We are Guy Halevi and Orr Jungerman, final-year students with great interest in data science. We chose this workshop to extend our knowledge and practice major data science principles. It's worth noting we live close to each other in middle-north Tel-Aviv (Old North neighborhood), a district notoriously known to lack parking options. When we were tasked with choosing a topic for this project, we wanted to tackle something important and beneficial for us. Thus, predicting parking in some way was one of the first ideas coming to mind. Luckily, we have been harvesting parking data about few parking lots for quite some time.
Our experience with parking lots in Tel-Aviv led us to believe that there are common recurring patterns. For example, on weekdays most parking lots are emptying in the morning and filling up at the evening due to day jobs. On the contrary, on weekends the parking lots tend to empty much later since most of the residents are not working and driving back to their families. To back our experience with real and objective data, we have been collecting for the past 2 years the vacancy status of popular parking lots, on a per-minute basis. We believe we can use this data to accurately predict the status of each parking lot, helping us to plan ahead our schedule. We expect to find interesting anomalies due to holidays or special events (such as lockdowns, rain, etc.). To deal with such anomalies, we will enrich the data with more information such as recurring holidays and more.
The data we collected comes in a CSV format, where each row contains the vacancy status of 6 different parking lots at specific time. The parking lots are: Basel, Asuta, Sheraton, Dan, Dubnov and Habima. There are 6 different vacancy statuses: Free - significant number of spots, Few - parking lot is almost full, Full - no spots at all, Active - no indication at that time, Unknown/NaN - we had trouble collecting the status. We started collecting data since July 19, with minute resolution. We managed to collect the data by writing a simple script that parses the website of each parking lot, for example http://www.ahuzot.co.il/Parking/ParkingDetails/?ID=3.
The following map shows the locations of the 6 lots:

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime, date, timedelta
import matplotlib.pyplot as plt
from ipywidgets import HTML
from more_itertools import first
from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from tqdm.auto import tqdm
import tensorflow as tf
tqdm.pandas()
np.random.seed(0)
RANDOM_STATE=0
pd.options.display.max_rows = 10
LOT_NAMES = ['Basel', 'Asuta', 'Sheraton', 'Dan', 'Dubnov', 'Habima']
DATA_CSV_FILENAME = 'june21_parks.csv'
Let's start by viewing the raw data.
We will use pandas to load the CSV into a dataframe:
raw_df = pd.read_csv(DATA_CSV_FILENAME)
raw_df
| Date | Day | Hour | Minute | Basel | Asuta | Sheraton | Dan | Dubnov | Habima | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13/07/2019 | Sat | 20 | 2 | Few | Full | Full | Full | NaN | NaN |
| 1 | 13/07/2019 | Sat | 20 | 3 | Few | Full | Full | Full | NaN | NaN |
| 2 | 13/07/2019 | Sat | 20 | 4 | Few | Full | Full | Full | NaN | NaN |
| 3 | 13/07/2019 | Sat | 20 | 5 | Few | Full | Full | Full | NaN | NaN |
| 4 | 13/07/2019 | Sat | 20 | 6 | Full | Full | Full | Full | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 941762 | 21/06/2021 | Mon | 0 | 2 | Full | Full | Free | Full | Free | Free |
| 941763 | 21/06/2021 | Mon | 0 | 3 | Full | Full | Free | Full | Free | Free |
| 941764 | 21/06/2021 | Mon | 0 | 4 | Full | Full | Free | Full | Free | Free |
| 941765 | 21/06/2021 | Mon | 0 | 5 | Full | Full | Free | Full | Free | Free |
| 941766 | 21/06/2021 | Mon | 0 | 6 | Full | Full | Free | Full | Free | Free |
941767 rows × 10 columns
First, we will copy our raw data, so we can modify the copy without altering the original data.
Let's change column names to more standard and conventient names, and set the right types to each column in order to be able to use it later.
full_df = raw_df.copy()
full_df = full_df.rename(columns={
'Day': 'day',
'Hour': 'hour',
'Minute': 'minute',
'Date': 'date'
})
full_df['datetime'] = full_df['date'] + " " + full_df["hour"].astype(str) + ":" + full_df["minute"].astype(str)
full_df['datetime'] = pd.to_datetime(full_df['datetime'], format='%d/%m/%Y %H:%M')
full_df['datetime'] = full_df['datetime'].dt.tz_localize('Asia/Jerusalem', 'infer')
full_df.pop('date')
full_df.head()
| day | hour | minute | Basel | Asuta | Sheraton | Dan | Dubnov | Habima | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sat | 20 | 2 | Few | Full | Full | Full | NaN | NaN | 2019-07-13 20:02:00+03:00 |
| 1 | Sat | 20 | 3 | Few | Full | Full | Full | NaN | NaN | 2019-07-13 20:03:00+03:00 |
| 2 | Sat | 20 | 4 | Few | Full | Full | Full | NaN | NaN | 2019-07-13 20:04:00+03:00 |
| 3 | Sat | 20 | 5 | Few | Full | Full | Full | NaN | NaN | 2019-07-13 20:05:00+03:00 |
| 4 | Sat | 20 | 6 | Full | Full | Full | Full | NaN | NaN | 2019-07-13 20:06:00+03:00 |
As you can see, datetime column is now recognized as a date object instead of a simple string.
Now, let's explore the "parking statuses", or classes in our terminology:
full_df[LOT_NAMES].apply(pd.value_counts)
| Basel | Asuta | Sheraton | Dan | Dubnov | Habima | |
|---|---|---|---|---|---|---|
| Active | 3858 | 3858 | 3858 | 3858 | 3859 | 3858 |
| Few | 293512 | 176866 | 207668 | 140452 | 129348 | 22980 |
| Free | 382716 | 462632 | 501342 | 596460 | 693716 | 829129 |
| Full | 243724 | 280415 | 210585 | 183649 | 96599 | 67557 |
| Unknown | 17957 | 17996 | 18314 | 17348 | 17351 | 17349 |
We can see that in addition to the 3 main statuses <Free, Few, Full>, there are 3 additional "unknown" statuses that are quite rare which we want to exclude from our dataset - <NaN, Active, Unknown>.
In addition to excluding the 3 "unknown" statuses, we need to decide how to handle the 3 "main" statuses.
The meaning of Few is that there are some parking spots available, but just a few. It means that in some cases, if we checked the status a few seconds earlier or later than we actually did, we would have been able to get different results. So, it makes sense to smooth the data. For example, if the status was Full for a while, then Few for a few minutes and then Full again, it makes more sense to translate this Few to Full than to Free, and the same for the other way around.
In order to do that, we will start by treating Few as "the middle between Free and Full", or in other words, we will translate Free -> 1, Few -> 0.5, Full -> 0, as Free has 1 (a lot) of free parking, Few has 0.5 (few) free parking and Full has 0 (no) free parking.
Later we will normalize each 0.5 point to either 0 or 1, depending on its surroundings.
for lot in LOT_NAMES:
full_df[lot] = full_df[lot].replace({
'Unknown': np.nan,
'Active': np.nan,
'Free': 1,
'Few': 0.5,
'Full': 0,
}, inplace=False)
print('Full dataset:')
display(full_df)
print('Classes:')
display(full_df[LOT_NAMES].apply(pd.value_counts))
Full dataset:
| day | hour | minute | Basel | Asuta | Sheraton | Dan | Dubnov | Habima | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sat | 20 | 2 | 0.5 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2019-07-13 20:02:00+03:00 |
| 1 | Sat | 20 | 3 | 0.5 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2019-07-13 20:03:00+03:00 |
| 2 | Sat | 20 | 4 | 0.5 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2019-07-13 20:04:00+03:00 |
| 3 | Sat | 20 | 5 | 0.5 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2019-07-13 20:05:00+03:00 |
| 4 | Sat | 20 | 6 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2019-07-13 20:06:00+03:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 941762 | Mon | 0 | 2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2021-06-21 00:02:00+03:00 |
| 941763 | Mon | 0 | 3 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2021-06-21 00:03:00+03:00 |
| 941764 | Mon | 0 | 4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2021-06-21 00:04:00+03:00 |
| 941765 | Mon | 0 | 5 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2021-06-21 00:05:00+03:00 |
| 941766 | Mon | 0 | 6 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2021-06-21 00:06:00+03:00 |
941767 rows × 10 columns
Classes:
| Basel | Asuta | Sheraton | Dan | Dubnov | Habima | |
|---|---|---|---|---|---|---|
| 0.0 | 243724 | 280415 | 210585 | 183649 | 96599 | 67557 |
| 0.5 | 293512 | 176866 | 207668 | 140452 | 129348 | 22980 |
| 1.0 | 382716 | 462632 | 501342 | 596460 | 693716 | 829129 |
In order to work with the data, it would be much more convenient if each row had only one class.
In addition, we will remove rows with the class NaN so that our data is clean. We didn't do it before to avoid removing a complete row because of one missing value at one parking lot.
We use melt() to do that:
melted_df = (
full_df
.melt(
id_vars=[
'datetime',
'day',
'hour',
'minute',
],
var_name='lot',
value_name='class'
)
.dropna()
)
melted_df
| datetime | day | hour | minute | lot | class | |
|---|---|---|---|---|---|---|
| 0 | 2019-07-13 20:02:00+03:00 | Sat | 20 | 2 | Basel | 0.5 |
| 1 | 2019-07-13 20:03:00+03:00 | Sat | 20 | 3 | Basel | 0.5 |
| 2 | 2019-07-13 20:04:00+03:00 | Sat | 20 | 4 | Basel | 0.5 |
| 3 | 2019-07-13 20:05:00+03:00 | Sat | 20 | 5 | Basel | 0.5 |
| 4 | 2019-07-13 20:06:00+03:00 | Sat | 20 | 6 | Basel | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 5650597 | 2021-06-21 00:02:00+03:00 | Mon | 0 | 2 | Habima | 1.0 |
| 5650598 | 2021-06-21 00:03:00+03:00 | Mon | 0 | 3 | Habima | 1.0 |
| 5650599 | 2021-06-21 00:04:00+03:00 | Mon | 0 | 4 | Habima | 1.0 |
| 5650600 | 2021-06-21 00:05:00+03:00 | Mon | 0 | 5 | Habima | 1.0 |
| 5650601 | 2021-06-21 00:06:00+03:00 | Mon | 0 | 6 | Habima | 1.0 |
5519350 rows × 6 columns
Our dataset is incomplete - sometimes the scraper didn't run and sometimes there were problems with the API. However, we can try to rectify some of the lost data by resampling and applying a moving average. This approach will also help us to reduce noise and small anomalies.
Using moving average, the discrete availability class becomes a continious variable of the parking availability likelihood at that point of time.
# Upsample to 1m
resampled_df = melted_df.set_index('datetime').groupby('lot').resample('60s').first()
# Drop `lot` column - upsampled datapoints are set to na, including lot
resampled_df = resampled_df.drop(columns=['lot'])
# Ungroup, so all the samples get the proper lot and datetime values
resampled_df = resampled_df.reset_index()
# Group upsampled data by lot
avg_df = resampled_df.groupby(by=['lot'])
# Apply moving average on upsampled data
avg_df = avg_df.rolling('1h', on='datetime', min_periods=5, center=True, win_type='gaussian')['class'].mean(std=0.5)
# Drop data points that we could not decide their availability
avg_df = avg_df.dropna().reset_index()
#Rename class column to availability
avg_df = avg_df.rename(columns={"class": "availability"})
print(f"number of original samples: {len(melted_df)}")
print(f"number of resampled samples: {len(avg_df)}")
print(f"rectified: {len(avg_df) - len(melted_df)} samples")
avg_df
number of original samples: 5519350 number of resampled samples: 5536426 rectified: 17076 samples
| lot | datetime | availability | |
|---|---|---|---|
| 0 | Asuta | 2019-07-13 20:02:00+03:00 | 0.0 |
| 1 | Asuta | 2019-07-13 20:03:00+03:00 | 0.0 |
| 2 | Asuta | 2019-07-13 20:04:00+03:00 | 0.0 |
| 3 | Asuta | 2019-07-13 20:05:00+03:00 | 0.0 |
| 4 | Asuta | 2019-07-13 20:06:00+03:00 | 0.0 |
| ... | ... | ... | ... |
| 5536421 | Sheraton | 2021-06-21 00:02:00+03:00 | 1.0 |
| 5536422 | Sheraton | 2021-06-21 00:03:00+03:00 | 1.0 |
| 5536423 | Sheraton | 2021-06-21 00:04:00+03:00 | 1.0 |
| 5536424 | Sheraton | 2021-06-21 00:05:00+03:00 | 1.0 |
| 5536425 | Sheraton | 2021-06-21 00:06:00+03:00 | 1.0 |
5536426 rows × 3 columns
After we applied a moving average resulting in the availability feature, we'd like to define a threshold for parking status classification.
Since Few status was translated as 0.5, we believe it should act as the threshold nubmer too.
Every hour that its average is lower than 0.5 is an hour that was Full most of the time and will be normalized to 0, and every hour that is higher than 0.5 will be normalized to 1.
Regarding hours with an average of exactly 0.5, we decided to treat them as Free (i.e. 0), as it means that for a whole hour, most of the times there were "at least a few spots available".
For computational reasons and simplicity of the data analysis, we decided to downsample the dataset to 1H.
avg_df['class'] = (avg_df['availability'] >= 0.5).astype(int)
avg_df
| lot | datetime | availability | class | |
|---|---|---|---|---|
| 0 | Asuta | 2019-07-13 20:02:00+03:00 | 0.0 | 0 |
| 1 | Asuta | 2019-07-13 20:03:00+03:00 | 0.0 | 0 |
| 2 | Asuta | 2019-07-13 20:04:00+03:00 | 0.0 | 0 |
| 3 | Asuta | 2019-07-13 20:05:00+03:00 | 0.0 | 0 |
| 4 | Asuta | 2019-07-13 20:06:00+03:00 | 0.0 | 0 |
| ... | ... | ... | ... | ... |
| 5536421 | Sheraton | 2021-06-21 00:02:00+03:00 | 1.0 | 1 |
| 5536422 | Sheraton | 2021-06-21 00:03:00+03:00 | 1.0 | 1 |
| 5536423 | Sheraton | 2021-06-21 00:04:00+03:00 | 1.0 | 1 |
| 5536424 | Sheraton | 2021-06-21 00:05:00+03:00 | 1.0 | 1 |
| 5536425 | Sheraton | 2021-06-21 00:06:00+03:00 | 1.0 | 1 |
5536426 rows × 4 columns
hourly_df = avg_df[avg_df['datetime'].dt.minute == 0].copy()
hourly_df['datehour'] = hourly_df['datetime'].dt.date.astype(str) + " - " + hourly_df['datetime'].dt.hour.astype(str).str.zfill(2)
hourly_df = hourly_df.set_index('datehour')
hourly_df['date'] = hourly_df['datetime'].dt.normalize()
hourly_df
| lot | datetime | availability | class | date | |
|---|---|---|---|---|---|
| datehour | |||||
| 2019-07-13 - 21 | Asuta | 2019-07-13 21:00:00+03:00 | 0.000000 | 0 | 2019-07-13 00:00:00+03:00 |
| 2019-07-13 - 22 | Asuta | 2019-07-13 22:00:00+03:00 | 0.033333 | 0 | 2019-07-13 00:00:00+03:00 |
| 2019-07-13 - 23 | Asuta | 2019-07-13 23:00:00+03:00 | 0.033333 | 0 | 2019-07-13 00:00:00+03:00 |
| 2019-07-14 - 00 | Asuta | 2019-07-14 00:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 |
| 2019-07-14 - 01 | Asuta | 2019-07-14 01:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 |
| ... | ... | ... | ... | ... | ... |
| 2021-06-20 - 20 | Sheraton | 2021-06-20 20:00:00+03:00 | 0.216667 | 0 | 2021-06-20 00:00:00+03:00 |
| 2021-06-20 - 21 | Sheraton | 2021-06-20 21:00:00+03:00 | 0.275000 | 0 | 2021-06-20 00:00:00+03:00 |
| 2021-06-20 - 22 | Sheraton | 2021-06-20 22:00:00+03:00 | 0.158333 | 0 | 2021-06-20 00:00:00+03:00 |
| 2021-06-20 - 23 | Sheraton | 2021-06-20 23:00:00+03:00 | 0.716667 | 1 | 2021-06-20 00:00:00+03:00 |
| 2021-06-21 - 00 | Sheraton | 2021-06-21 00:00:00+03:00 | 1.000000 | 1 | 2021-06-21 00:00:00+03:00 |
92280 rows × 5 columns
Now comes the part where we want to visualize our data. The visualizations are supposes to give us a grasp on our data and validate our hypothesis.
Let's start by visualizing the classes distribution among the different lots.
We will use a color palette that gives us the right intuition - 1 is "good" (there are free parking spots!), therefore green, and 0 is "bad", therefore red.
GREEN_RED_PALETTE = {1.0: 'green', 0.0: 'red'}
plt.rcParams["figure.figsize"] = (15,5)
_ = sns.histplot(
data=hourly_df,
x='lot',
hue='class',
multiple='dodge',
palette=GREEN_RED_PALETTE,
shrink=0.7,
).set_title('Class Distribution Per Parking Lot')
As the histogram shows, some lots have relatively equally distributed classes (e.g. Basel), while some have almost only one class (e.g. Habima).
We can also see that Free is the most common class in all lots.
It's already some interesting information for us, as we personally feel like Basel, the one that's closest to where we both live.
We'd like to mention how difficult it is to visualize the data in a way that makes sense and that is intuitive. We tried many methods but all showed too much information that was not helpful in any way.
For example, let's show classes over time, for one specific month - January 2020:
plt.rcParams["figure.figsize"] = (15,5)
sns.lineplot(
marker='o',
x='date',
y='class',
hue='lot',
data=hourly_df[(hourly_df['datetime'].dt.year == 2020) & (hourly_df['datetime'].dt.month == 1)],
)
for item in plt.xticks()[1]:
item.set_rotation(90)
As you can see, even after we filter out most data and stay with only one month of data (out of ~2 years that we have!) is too much for visualizing without any special treatment.
Then we came up with the idea of weekly patterns. As mentioned before, we believe the parking vacancy status recurrs every week. Let's transform our data to validate this hypothesis.
From our own knowledge and experience, we know that available parking spots change over the week in a periodic way that is very similar between different weeks. For example, every morning in the week days there is quite a lot of parking, and every evening there isn't. In the weekends it changes; Friday morning is busy but Friday evening is always super free (as most young people visit their parents outside of Tel Aviv, probably).
It's a very interesting hypothesis to start with.
Let's use FFT (Fast Fourier Transform) to prove that these patterns actually exist:
fft_df = hourly_df.copy().reset_index().set_index('datetime')['class']
fft_df = fft_df.resample('1h').first()
# fft ignores random noise
fft_df = fft_df.apply(lambda l: l if not np.isnan(l) else np.random.random())
fft = tf.signal.rfft(fft_df)
freqs = np.arange(0, len(fft))
total_hours = len(fft_df)
hours_in_week = 7 * 24
steps = freqs / total_hours * hours_in_week
plt.step(steps, np.abs(fft))
plt.xscale('log')
plt.ylim(0, 3000)
plt.xlim([0.2, 7 * 24]) # Limits to about 1/month - 1/hour
_ = plt.xticks([14, 7, 1], labels=['1 / 12-hours', '1 / day', '1 / week'])
The FFT diagram definitely shows that there is a significant recurring pattern every day (which means that the time of day is very correlated to the parking availability status) and a smaller spike, yet still visible and significant, every week (which means that the day of the week is correlated with the status).
We expected the weekly pattern to be more significant, but we assume it's not as significant as we expected because we expect all 5 weekdays to be quite the same, and only weekends to be different.
Let's group the data by weeks (every week is defined as 00:00 on Sunday until 23:00 on Saturday). To simplify the data, we drop a few rows that are before the first whole week we have data for:
FIRST_DAY_OF_FIRST_WHOLE_WEEK = pd.to_datetime(date(year=2019, month=7, day=14)).tz_localize('Asia/Jerusalem')
hourly_df = hourly_df[hourly_df['date'] >= FIRST_DAY_OF_FIRST_WHOLE_WEEK].copy()
hourly_df['week_first_day'] = (
((hourly_df['date'] - FIRST_DAY_OF_FIRST_WHOLE_WEEK).dt.days / 7)
.astype(int)
.apply(lambda week_idx: FIRST_DAY_OF_FIRST_WHOLE_WEEK + pd.to_timedelta(timedelta(weeks=week_idx)))
)
hourly_df['day_hour'] = hourly_df['date'].dt.day_name() + '-' + hourly_df['datetime'].dt.hour.astype(str).str.zfill(2)
hourly_df
| lot | datetime | availability | class | date | week_first_day | day_hour | |
|---|---|---|---|---|---|---|---|
| datehour | |||||||
| 2019-07-14 - 00 | Asuta | 2019-07-14 00:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-00 |
| 2019-07-14 - 01 | Asuta | 2019-07-14 01:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-01 |
| 2019-07-14 - 02 | Asuta | 2019-07-14 02:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-02 |
| 2019-07-14 - 03 | Asuta | 2019-07-14 03:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-03 |
| 2019-07-14 - 04 | Asuta | 2019-07-14 04:00:00+03:00 | 0.000000 | 0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-04 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-06-20 - 20 | Sheraton | 2021-06-20 20:00:00+03:00 | 0.216667 | 0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-20 |
| 2021-06-20 - 21 | Sheraton | 2021-06-20 21:00:00+03:00 | 0.275000 | 0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-21 |
| 2021-06-20 - 22 | Sheraton | 2021-06-20 22:00:00+03:00 | 0.158333 | 0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-22 |
| 2021-06-20 - 23 | Sheraton | 2021-06-20 23:00:00+03:00 | 0.716667 | 1 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-23 |
| 2021-06-21 - 00 | Sheraton | 2021-06-21 00:00:00+03:00 | 1.000000 | 1 | 2021-06-21 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Monday-00 |
92268 rows × 7 columns
Now that we have week information for each row, we can start showing weekly trends.
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Parking Per Hour In The Week', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='day_hour',
y='class',
data=hourly_df
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
g = sns.lineplot(
ax=ax,
x='day_hour',
y='class',
data=hourly_df[hourly_df['lot'] == lot]
)
g.axhline(0.5, color='red')
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0, top=1)
for i, label in enumerate(ax.get_xticklabels()):
if i % 8 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
Now the visualization is much better and it actually starts to make sense!
We finally have our first informative visualization, that supports our hypothesis of weekly trends.
Note the width of the lines. The wider the line is, the more variation there is in the classes at that point. We can see that all lines are relatively narrow, especially when comparing to the earlier unsuccessful visualization.
The meaning of the narrow lines is that for every hour in a week, its classes are quite similar between all weeks in our data.
Even though it's a pretty good start, the visualization does not prove that the data is completely predictable using lot and day_hour features only. It only shows that it has good correlation.
Now let's take a look from another point of view.
Instead of showing the average parking per hour in the week, among all weeks, let's show the average parking per week over time.
To do so, we need to have the average parking status per week over time.
We define the parking status of a week as the average of all hours within the week.
It's important to note that our class will no longer be one of 2 discrete options - it will become a continuous number. We will only use this data for this visualization and then come back to the earlier per-hour format.
weekly_df = (
hourly_df
.groupby(by=['week_first_day', 'lot'])
.agg({'class': 'mean'})
.reset_index()
)
weekly_df
| week_first_day | lot | class | |
|---|---|---|---|
| 0 | 2019-07-14 00:00:00+03:00 | Asuta | 0.595238 |
| 1 | 2019-07-14 00:00:00+03:00 | Basel | 0.547619 |
| 2 | 2019-07-14 00:00:00+03:00 | Dan | 0.494048 |
| 3 | 2019-07-14 00:00:00+03:00 | Dubnov | 0.840764 |
| 4 | 2019-07-14 00:00:00+03:00 | Habima | 0.878981 |
| ... | ... | ... | ... |
| 571 | 2021-06-20 00:00:00+03:00 | Basel | 0.480000 |
| 572 | 2021-06-20 00:00:00+03:00 | Dan | 0.600000 |
| 573 | 2021-06-20 00:00:00+03:00 | Dubnov | 1.000000 |
| 574 | 2021-06-20 00:00:00+03:00 | Habima | 0.880000 |
| 575 | 2021-06-20 00:00:00+03:00 | Sheraton | 0.560000 |
576 rows × 3 columns
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Parking Per Week Over Time', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='class',
data=weekly_df
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='class',
data=weekly_df[weekly_df['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0, top=1)
for i, label in enumerate(ax.get_xticklabels()):
label.set_rotation(90)
As we can see, there have been some changes over time.
There are several specific things to take from this view.
The first one is that there is a big gap without any data points somewhere between 2020-05 and 2020-07. Not only that, but the next week after the gap is extermely different than usual, mosly to the worse, but not only, and it differs a lot among the different lots as shown by the width of the top line.
The second is the variation around 2019-11, as shown by the width of the top line. Some of the parking lots had much more space than usual, and some much less.
The third is the two peak points around 2020-04 and 2020-10, in which all lots had unusual big amount of free parking spots.
We will try to use these insights later in this project, and will also try to explain them.
Finally, we had an idea for a visualization that combines most data, and visualizes it in a way that makes much sense and is very intuitive.
The idea is to show every week of a specific parking lot as a row of points, each has a color as before (green means Free, red means Full). To show trends over time, we put weeks above other ones, so that the lowest line is the earliest week:
fig, axes = plt.subplots(len(LOT_NAMES), 1, figsize=(15, 8 * len(LOT_NAMES)), sharey=True)
fig.suptitle('Parking Per Hour In The Week Over Time', fontsize=22, y=0.895)
plt.subplots_adjust(hspace=0.4)
for i, lot in enumerate(LOT_NAMES):
ax = axes[i]
plot = sns.scatterplot(
ax=ax,
palette=GREEN_RED_PALETTE,
legend=False,
x='day_hour',
y='week_first_day',
hue='class',
s=10,
data=hourly_df[hourly_df['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
for i, label in enumerate(ax.get_xticklabels()):
if i % 4 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
Perfect! We can see here most insights from both visualizations above.
The weekly trends are vertical trends (green / red "columns"), and changes over time are horizontal trends, like complete rows (weeks) that are completely green, that demonstrate complete weeks with only Free statuses.
Before we begin, we need to choose the features we want, and handle their types.
Currently, the features we have look like that:
clean_hourly_df = hourly_df.copy().reset_index()[['lot', 'datetime', 'class']]
clean_hourly_df
| lot | datetime | class | |
|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 |
| ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 |
92268 rows × 3 columns
If we want the models to be able to use the inner information of datetime, we need to create columns describing the day and the time of the day.
For day, the simplest option is a one-hot representation, with a column for every day of the week (using pd.get_dummies).
For hour, the simplest option is simply the numeric representation between 0-23.
normalized_df = clean_hourly_df.copy()
normalized_df['day'] = normalized_df['datetime'].dt.day_name()
normalized_df['hour'] = normalized_df['datetime'].dt.hour
normalized_df = normalized_df.set_index(['lot', 'datetime'])
normalized_df = pd.get_dummies(normalized_df, prefix='', prefix_sep='')
normalized_df = normalized_df.reset_index()
normalized_df
| lot | datetime | class | hour | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 | 20 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 | 21 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 | 22 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 | 23 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
92268 rows × 11 columns
When handling with a classification task, it is important to split wisely train and test data.
Usually, in non-time-based problems, the best practice is to split the data randomly.
However, in our case, we want to simulate reality, in which we have prior data and we want to predict the future.
For this reason, we start by splitting the data in a way that train is all data before 31/12/2020, and test is all the data from 2021.
In addition, we split our data by parking lot name, so we can train different models on each seperately.
TRAIN_TEST_SPLIT_DATE = pd.to_datetime(date(year=2021, month=1, day=1)).tz_localize('Asia/Jerusalem')
dataset_per_lot = {}
for lot_name, lot_df in normalized_df.groupby('lot'):
train_data = lot_df[lot_df['datetime'] < TRAIN_TEST_SPLIT_DATE].drop(columns=['lot', 'datetime'])
test_data = lot_df[lot_df['datetime'] >= TRAIN_TEST_SPLIT_DATE].drop(columns=['lot', 'datetime'])
dataset_per_lot[lot_name] = (train_data, test_data)
train_values = sum([len(x[0]) for x in dataset_per_lot.values()])
test_values = sum([len(x[1]) for x in dataset_per_lot.values()])
print(f'Train data has {train_values} values')
print(f'Test data has {test_values} values')
print(f'Test data is {100 * test_values / train_values:.2f}% of all data')
Train data has 67650 values Test data has 24618 values Test data is 36.39% of all data
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
def split_df_to_x_y(df):
df = df.copy()
return df, df.pop('class')
scores = list()
for lot, (train, test) in dataset_per_lot.items():
train_x, train_y = split_df_to_x_y(train)
test_x, test_y = split_df_to_x_y(test)
clf = LogisticRegression(random_state=RANDOM_STATE, max_iter=1e9)
clf.fit(train_x, train_y)
score = accuracy_score(y_true=test_y, y_pred=clf.predict(test_x))
scores.append({'lot':lot, 'score':score})
scores_df = pd.DataFrame.from_dict(scores).set_index('lot')
scores_df.loc['Average'] = scores_df['score'].mean()
print(scores_df.to_string())
score lot Asuta 0.520351 Basel 0.503047 Dan 0.721423 Dubnov 0.998781 Habima 0.918840 Sheraton 0.620034 Average 0.713746
The accuracy results of this basic logistic regression are very diverse: the accuracy in Basel is as good as random (and worse than choosing the most common class) and the one in Dubnov is way higher than it's fair to expect.
When you think about it, there are a few reasons for that, which are quite clear.
The first and main reason is that we chose a random point in time, and measured ourselves according to this data split only. This point in time is much too meaningfull - if we chose another point in time we would get completely different results.
The second reason is that we used less that 2/3 of the data for training, and it might have not been enough. We should use more data for training and test ourselves on less.
The third is related to the first and is specific to Dubnov. It seems that Dubnov lot was nearly only free during the whole year of 2021 so far, so a simple calssifier that always predicts "free" would have almost 100% accury. This is not True for all the time before 2021, though.
To avoid giving too much of a meaning to one chosen splitting point, we will use a technique similar to the classic k-folds cross validation.
We will leave some of the data aside as the test set, and use all other data for training and validating the results.
The train-validation data will then be used in 10 different splits, each containing a 12-month period train set and a 1-month period validation set, of the month following the train year.
Each model will be trained and tested 10 different times (independently), and its score will be the average accuracy score of all 10 times.
After comparing different models and different parameters, and choosing the best one according to its scores over the train-validation data, we will train the chosen model on all data before the test set starts, and evaluate its results on the test set.
MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION = [
(7, 2020), (8, 2020), (9, 2020), (10, 2020), (11, 2020), (12, 2020), (1, 2021), (2, 2021), (3, 2021), (4, 2021)
]
TEST_SPLIT_MONTH_AND_YEAR = (5, 2021)
def create_train_validation_test_datasets(df, print_results=False):
train_and_validation_datasets_per_lot = []
for month, year in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION:
current_dataset_per_lot = {}
for lot_name, lot_df in df.groupby('lot'):
starting_train_date = pd.to_datetime(date(year=year - 1, month=month, day=1)).tz_localize('Asia/Jerusalem')
starting_validation_date = pd.to_datetime(date(year=year, month=month, day=1)).tz_localize('Asia/Jerusalem')
if month == 12:
ending_validation_date = pd.to_datetime(date(year=year + 1, month=1, day=1)).tz_localize('Asia/Jerusalem')
else:
ending_validation_date = pd.to_datetime(date(year=year, month=month + 1, day=1)).tz_localize('Asia/Jerusalem')
train_data = lot_df[(lot_df['datetime'] >= starting_train_date) & (lot_df['datetime'] < starting_validation_date)]
validation_data = lot_df[(lot_df['datetime'] >= starting_validation_date) & (lot_df['datetime'] < ending_validation_date)]
current_dataset_per_lot[lot_name] = (
train_data.drop(columns=['lot', 'datetime']), validation_data.drop(columns=['lot', 'datetime'])
)
train_and_validation_datasets_per_lot.append(current_dataset_per_lot)
test_dataset_per_lot = {}
for lot_name, lot_df in df.groupby('lot'):
month, year = TEST_SPLIT_MONTH_AND_YEAR
starting_test_date = pd.to_datetime(date(year=year, month=month, day=1)).tz_localize('Asia/Jerusalem')
ending_test_date = pd.to_datetime(date(year=year, month=month + 1, day=1)).tz_localize('Asia/Jerusalem')
train_data = lot_df[lot_df['datetime'] < starting_test_date]
test_data = lot_df[(lot_df['datetime'] >= starting_test_date) & (lot_df['datetime'] < ending_test_date)]
test_dataset_per_lot[lot_name] = (
train_data.drop(columns=['lot', 'datetime']), test_data.drop(columns=['lot', 'datetime'])
)
if print_results:
print(f'Folds in train-validation data: {len(train_and_validation_datasets_per_lot)}')
print(f'Shape of example train data for validation: {train_and_validation_datasets_per_lot[0]["Basel"][0].shape}')
print(f'Shape of example validation data: {train_and_validation_datasets_per_lot[0]["Basel"][1].shape}')
print(f'\nShape of example train data for test data: {test_dataset_per_lot["Basel"][0].shape}')
print(f'Shape of example test data: {test_dataset_per_lot["Basel"][1].shape}')
return train_and_validation_datasets_per_lot, test_dataset_per_lot
def _evaluate_model_once(model_training_func, model_evaluating_func, dataset_per_lot, use_test_as_val=False):
scores = {}
for lot, (train, test) in dataset_per_lot.items():
train_x, train_y = split_df_to_x_y(train)
test_x, test_y = split_df_to_x_y(test)
if use_test_as_val:
model = model_training_func(train_x, train_y, lot, val_x=test_x, val_y=test_y)
else:
model = model_training_func(train_x, train_y, lot)
score = model_evaluating_func(model, test_x, test_y)
scores[lot] = score
return scores
def evaluate_model(model_training_func, model_evaluating_func, datasets_per_lot,
print_results=True, detailed=False, row_names=None, use_test_as_val=False):
score_dicts = []
if print_results:
datasets_per_lot = tqdm(datasets_per_lot)
for dataset_per_lot in datasets_per_lot:
score_dicts.append(_evaluate_model_once(model_training_func, model_evaluating_func, dataset_per_lot, use_test_as_val=use_test_as_val))
lot_to_scores = {lot: [score_dict[lot] for score_dict in score_dicts] for lot in LOT_NAMES}
scores_df = pd.DataFrame.from_dict(lot_to_scores)
if row_names:
scores_df.index = row_names
if print_results and detailed:
scores_df['Average'] = scores_df.mean(axis=1)
scores_df.loc['Average'] = scores_df.mean()
plt.figure(figsize=(5,5))
sns.heatmap(scores_df, annot=True)
plt.show()
scores_df = scores_df.mean().T
scores_df.loc['Average'] = scores_df.mean()
if print_results:
print(scores_df.to_string())
return scores_df.to_dict()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(normalized_df,
print_results=True)
Folds in train-validation data: 10 Shape of example train data for validation: (6946, 9) Shape of example validation data: (684, 9) Shape of example train data for test data: (14161, 9) Shape of example test data: (743, 9)
As we planned, a random fold has ~12 times more data in its train set than its validation one, and the final test set has similar amount of data to a typical validation set.
Now let's run the same LogisticRegression model and see its average validation score.
results_df = pd.DataFrame(columns=['average score'])
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: (LogisticRegression(random_state=RANDOM_STATE, max_iter=1e9)
.fit(train_x, train_y)),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['initial logistic regression'] = score['Average']
results_df
Basel 0.582491 Asuta 0.586554 Sheraton 0.657385 Dan 0.842706 Dubnov 0.923519 Habima 0.941570 Average 0.755704
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
Now it makes much more sense!
For example, Basel's score is much higher than 50%, and Dubnov's is not 99.9% anymore.
Now that we have a good way to measure the models, let's try DecisionTree and RandomForest and compare them to the initial LogisticRegression.
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: (DecisionTreeClassifier(max_depth=4, random_state=RANDOM_STATE)
.fit(train_x, train_y)),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['initial decision tree'] = score['Average']
results_df
Basel 0.737645 Asuta 0.780867 Sheraton 0.687506 Dan 0.796338 Dubnov 0.901748 Habima 0.937744 Average 0.806975
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(*split_df_to_x_y(train_and_validation_datasets_per_lot[0]['Basel'][0]))
plt.figure(figsize=(15,5))
_ = tree.plot_tree(
clf,
feature_names=[col for col in train_and_validation_datasets_per_lot[0]['Basel'][0].columns if col != 'class'],
proportion=True,
impurity=True,
filled=True,
fontsize=10,
class_names=['Full', 'Free']
)
plt.show()
This graph describes the decision process of the example Decision Tree model.
Each node contains classifying criteria such as hour or day of the week. The root of the graph is the starting point and then we follow its classification rules down to the nodes. We go left when the critiera are met, and right if not.
For example, let's say we wan't to predict the status at Friday's evening (6pm). First node classifies by threshold hour<=17.5. Since our hour does not meet with this criterion, we take the right path. Then the criterion is Friday<=0.5 ("not Friday") so we go right again. Finally we reach the final leaf that tells us our predicted class is Free.
Each node contains extra information that tells us about the certainty of each decision node. Gini score implies the probability of a class to belong to multiple classes. A gini score of 0 means that the decision node is 100% sure there is only one single class. Samples is the percentage of trained data that reached the corresponding decision node. Value describes how the trained data is distributed among the classes.
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: (RandomForestClassifier(max_depth=4, n_estimators=10, random_state=RANDOM_STATE)
.fit(train_x, train_y)),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['initial random forest'] = score['Average']
results_df
Basel 0.698937 Asuta 0.739620 Sheraton 0.705568 Dan 0.822049 Dubnov 0.943618 Habima 0.940985 Average 0.808463
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
Let's now take it another step forward, and compare more models with more parameter combinations to find the best one for each lot:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
all_clf_confs = [
(LogisticRegression, dict(max_iter=1e9, random_state=RANDOM_STATE)),
(DecisionTreeClassifier, dict(max_depth=4, random_state=RANDOM_STATE)),
(DecisionTreeClassifier, dict(max_depth=6, random_state=RANDOM_STATE)),
(DecisionTreeClassifier, dict(max_depth=8, random_state=RANDOM_STATE)),
(RandomForestClassifier, dict(max_depth=4, n_estimators=10, random_state=RANDOM_STATE)),
(RandomForestClassifier, dict(max_depth=6, n_estimators=10, random_state=RANDOM_STATE)),
(RandomForestClassifier, dict(max_depth=8, n_estimators=10, random_state=RANDOM_STATE)),
(KNeighborsClassifier, dict(n_neighbors=3)),
(KNeighborsClassifier, dict(n_neighbors=10)),
(KNeighborsClassifier, dict(n_neighbors=100)),
(AdaBoostClassifier, dict(random_state=RANDOM_STATE)),
(GaussianNB, dict()),
]
def find_lot_to_best_clf(clf_confs):
lot_to_max_score = {lot: -1 for lot in LOT_NAMES}
lot_to_best_clf = {lot: None for lot in LOT_NAMES}
for clf_class, clf_kwargs in tqdm(clf_confs, desc='Finding best classifier for each lot'):
lot_to_score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: clf_class(**clf_kwargs).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
print_results=False
)
for lot in LOT_NAMES:
if lot_to_score[lot] > lot_to_max_score[lot]:
lot_to_max_score[lot] = lot_to_score[lot]
lot_to_best_clf[lot] = (clf_class, clf_kwargs)
return lot_to_best_clf
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
for lot, (cls, kwargs) in lot_to_best_clf.items():
params = {k: v for k, v in kwargs.items() if k != 'random_state'}
print(f'{lot}:{" " * (15 - len(lot))}{cls.__name__}{" " * (30 - len(cls.__name__))}(params: {params})')
Basel: DecisionTreeClassifier (params: {'max_depth': 8})
Asuta: DecisionTreeClassifier (params: {'max_depth': 8})
Sheraton: RandomForestClassifier (params: {'max_depth': 8, 'n_estimators': 10})
Dan: LogisticRegression (params: {'max_iter': 1000000000.0})
Dubnov: RandomForestClassifier (params: {'max_depth': 4, 'n_estimators': 10})
Habima: LogisticRegression (params: {'max_iter': 1000000000.0})
As the results show, the best classifiers is not absolute and every lot has its best configuration.
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['initial best classifier'] = score['Average']
results_df
Basel 0.750294 Asuta 0.792074 Sheraton 0.725006 Dan 0.842706 Dubnov 0.943618 Habima 0.941570 Average 0.832545
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
This score is amazing, comparing to the initial logistic regression we started with. By changing nothing but models and params, the accuracy score was improved by almost 8%!
The first improvement we want to try is to change the representation of the hour and the day from a numeric value of 0-23 and a one-hot representation, to a continuous representation in which the difference between the hours 23 and 1 is the same as the difference between 1 and 3, and the same with days of the week.
To do so, we will create a sinus and a cosinus of the day and the week, and instead of saving the hour or the day, we will save a point on the wave that represents the same information.
SECONDS_IN_DAY = 60 * 60 * 24
SECONDS_IN_WEEK = SECONDS_IN_DAY * 7
sin_normalized_df = clean_hourly_df.copy()
ts = sin_normalized_df['datetime'].astype('int64') // 10**9
sin_normalized_df['day sin'] = np.sin(ts * (2 * np.pi / SECONDS_IN_DAY))
sin_normalized_df['day cos'] = np.cos(ts * (2 * np.pi / SECONDS_IN_DAY))
sin_normalized_df['week sin'] = np.sin(ts * (2 * np.pi / SECONDS_IN_WEEK))
sin_normalized_df['week cos'] = np.cos(ts * (2 * np.pi / SECONDS_IN_WEEK))
display(sin_normalized_df)
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
<ipython-input-1-688407a4f63f>:5: FutureWarning: casting datetime64[ns, Asia/Jerusalem] values to int64 with .astype(...) is deprecated and will raise in a future version. Use .view(...) instead.
ts = sin_normalized_df['datetime'].astype('int64') // 10**9
| lot | datetime | class | day sin | day cos | week sin | week cos | |
|---|---|---|---|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 | -7.071068e-01 | 7.071068e-01 | 0.532032 | -0.846724 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 | -5.000000e-01 | 8.660254e-01 | 0.500000 | -0.866025 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 | -2.588190e-01 | 9.659258e-01 | 0.467269 | -0.884115 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 | -5.727700e-13 | 1.000000e+00 | 0.433884 | -0.900969 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 | 2.588190e-01 | 9.659258e-01 | 0.399892 | -0.916562 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 | -9.659258e-01 | -2.588190e-01 | -0.185912 | -0.982566 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 | -1.000000e+00 | -1.054628e-11 | -0.222521 | -0.974928 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 | -9.659258e-01 | 2.588190e-01 | -0.258819 | -0.965926 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 | -8.660254e-01 | 5.000000e-01 | -0.294755 | -0.955573 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 | -7.071068e-01 | 7.071068e-01 | -0.330279 | -0.943883 |
92268 rows × 7 columns
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['advanced time norm best classifier'] = score['Average']
results_df
Basel 0.746726 Asuta 0.795590 Sheraton 0.726342 Dan 0.810053 Dubnov 0.923034 Habima 0.941570 Average 0.823886
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
The results are surprising - the accuracy with the original representation of the days and hours led to higher accuracy. So, we will go back to using the original representation for now.
In order to enhance our results, we want to add as much relevant information as possible to the dataset.
We will use the following data:
All the data was taken from the internet using some good search and crawling methods that won't be detailed here, and saved to different CSV files.
We are going to use each of the files separately, find its impact, and in the end, merge everything together.
initial_weather_df = pd.read_csv('weather.csv')
initial_weather_df.head(3)
| Name | Date time | Maximum Temperature | Minimum Temperature | Temperature | Wind Chill | Heat Index | Precipitation | Snow | Snow Depth | Wind Speed | Wind Direction | Wind Gust | Visibility | Cloud Cover | Relative Humidity | Conditions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | תל אביב - יפו, תל אביב, ישראל | 07/01/2019 00:00:00 | 26.9 | 26.9 | 26.9 | NaN | 29.4 | 0.0 | 0.0 | 0.0 | 5.7 | 4.0 | NaN | 10.0 | 48.2 | 78.65 | Partially cloudy |
| 1 | תל אביב - יפו, תל אביב, ישראל | 07/01/2019 01:00:00 | 25.6 | 25.6 | 25.6 | NaN | NaN | 0.0 | 0.0 | 0.0 | 3.4 | 17.0 | NaN | 10.0 | 19.5 | 79.49 | Clear |
| 2 | תל אביב - יפו, תל אביב, ישראל | 07/01/2019 02:00:00 | 26.0 | 26.0 | 26.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 5.4 | NaN | NaN | 10.0 | 27.8 | 78.64 | Partially cloudy |
initial_weather_df['Conditions'].value_counts()
Partially cloudy 8092 Clear 7793 Overcast 705 Rain, Partially cloudy 145 Rain, Overcast 55 Rain 22 Name: Conditions, dtype: int64
The weather dataset includes a lot of information, most of it seems irrelevant or duplicated. Our gut feeling was that rain would have a high impact on parking ratio, and maybe temperature as well, as people change habbits in different weather conditions, for example, go to the beach (and park near it) more as the weather is hotter.
To explore the data, let's first merge it with our main dataframe and look for correlations with class column:
initial_weather_df['hour_datetime'] = (
initial_weather_df['Date time']
.apply(lambda d: datetime.strptime(d, '%m/%d/%Y %H:%M:%S'))
.dt.tz_localize('Asia/Jerusalem', 'infer')
)
initial_weather_df['Conditions'] = initial_weather_df['Conditions'].map({
'Partially cloudy': 'Partially cloudy',
'Clear': 'Clear',
'Rain, Partially cloudy': 'Rain',
'Rain, Overcast': 'Rain',
'Rain': 'Rain',
})
weather_research_df = normalized_df.copy()
weather_research_df['hour_datetime'] = weather_research_df['datetime'].copy().dt.floor('H', ambiguous=True)
weather_research_df = (
weather_research_df
.merge(initial_weather_df, on='hour_datetime', how='left')
.drop(columns=['hour_datetime'])
)
weather_research_df = pd.get_dummies(weather_research_df, columns=['Conditions'])
weather_corr_df = pd.DataFrame(weather_research_df.corr()['class'].sort_values(key=abs, ascending=False))
weather_corr_df = weather_corr_df.reset_index()
weather_corr_df = weather_corr_df.rename(columns={'index': 'column', 'class': 'corr with class'})
weather_corr_df['non none ratio'] = weather_corr_df['column'].apply(lambda col: (
len(weather_research_df[~weather_research_df[col].isna()]) / len(weather_research_df)
))
with pd.option_context("display.max_rows", 50):
display(weather_corr_df)
| column | corr with class | non none ratio | |
|---|---|---|---|
| 0 | class | 1.000000 | 1.000000 |
| 1 | hour | -0.118215 | 1.000000 |
| 2 | Heat Index | 0.049283 | 0.233407 |
| 3 | Wind Gust | -0.048310 | 0.005722 |
| 4 | Cloud Cover | 0.042939 | 0.963844 |
| 5 | Wind Chill | 0.032296 | 0.018338 |
| 6 | Conditions_Rain | 0.028559 | 1.000000 |
| 7 | Friday | 0.022682 | 1.000000 |
| 8 | Conditions_Clear | -0.020303 | 1.000000 |
| 9 | Thursday | -0.017629 | 1.000000 |
| 10 | Wind Direction | -0.016937 | 0.894828 |
| 11 | Precipitation | 0.015884 | 0.968722 |
| 12 | Tuesday | -0.015695 | 1.000000 |
| 13 | Conditions_Partially cloudy | 0.015623 | 1.000000 |
| 14 | Saturday | 0.013945 | 1.000000 |
| 15 | Wind Speed | 0.012835 | 0.967681 |
| 16 | Sunday | -0.008499 | 1.000000 |
| 17 | Wednesday | 0.004536 | 1.000000 |
| 18 | Visibility | 0.004047 | 0.968331 |
| 19 | Relative Humidity | 0.003132 | 0.968396 |
| 20 | Minimum Temperature | -0.003031 | 0.968396 |
| 21 | Temperature | -0.003031 | 0.968396 |
| 22 | Maximum Temperature | -0.003031 | 0.968396 |
| 23 | Monday | 0.000687 | 1.000000 |
| 24 | Snow | NaN | 0.968396 |
| 25 | Snow Depth | NaN | 0.968396 |
The results are surprising! Heat Index and Cloud Cover are more correlated to class than Conditions_Rain!
The next step is to actually use the data to enhance the classification. In order to do that, we will choose some of the features. We will use only features that have a decent correlation, a decent existance ratio, and make at least some sense to us (as oppose to Wind Direction, which simply does not).
Therefore we will only take the features Heat Index, Cloud Cover and Conditions.
Temperature has such a low correlation that it will not be included!
initial_weather_df[['Heat Index', 'Cloud Cover']].describe()
| Heat Index | Cloud Cover | |
|---|---|---|
| count | 4223.000000 | 16740.000000 |
| mean | 31.493275 | 27.788250 |
| std | 2.922584 | 25.238947 |
| min | 25.900000 | 0.000000 |
| 25% | 29.200000 | 0.000000 |
| 50% | 31.200000 | 27.800000 |
| 75% | 33.300000 | 47.900000 |
| max | 45.300000 | 100.000000 |
To get rid of Nones, we will fill none values with the minimum value of each of the columns.
weather_df = normalized_df.copy()
weather_df['hour_datetime'] = weather_df['datetime'].copy().dt.floor('H', ambiguous=True)
weather_df = (
weather_df
.merge(initial_weather_df[['hour_datetime', 'Heat Index', 'Cloud Cover', 'Conditions']], on='hour_datetime', how='left')
.drop(columns=['hour_datetime'])
)
weather_df['Heat Index'] = weather_df['Heat Index'].fillna(weather_df['Heat Index'].min())
weather_df['Cloud Cover'] = weather_df['Cloud Cover'].fillna(weather_df['Cloud Cover'].min())
weather_df = pd.get_dummies(weather_df, columns=['Conditions'], prefix='', prefix_sep='')
display(weather_df)
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(weather_df)
| lot | datetime | class | hour | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | Heat Index | Cloud Cover | Clear | Partially cloudy | Rain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 50.0 | 0 | 1 | 0 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 50.0 | 0 | 1 | 0 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 50.0 | 0 | 1 | 0 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 44.6 | 0 | 1 | 0 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 33.0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 | 20 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 | 21 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 | 22 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 | 23 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 25.9 | 0.0 | 0 | 0 | 0 |
92268 rows × 16 columns
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot
)
results_df.loc['weather best classifier'] = score['Average']
results_df
Basel 0.737107 Asuta 0.791744 Sheraton 0.766462 Dan 0.837714 Dubnov 0.946121 Habima 0.941570 Average 0.836786
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
| weather best classifier | 0.836786 |
There we go! The accuracy is improved by ~0.1% thanks to the new data.
initial_sun_hours_df = pd.read_csv('sun_hours.csv')
initial_sun_hours_df
| date | sunrise | sunset | |
|---|---|---|---|
| 0 | 2019-07-01 | 05:38 | 19:51 |
| 1 | 2019-07-02 | 05:38 | 19:51 |
| 2 | 2019-07-03 | 05:38 | 19:51 |
| 3 | 2019-07-04 | 05:39 | 19:50 |
| 4 | 2019-07-05 | 05:39 | 19:50 |
| ... | ... | ... | ... |
| 726 | 2021-06-26 | 05:36 | 19:51 |
| 727 | 2021-06-27 | 05:36 | 19:51 |
| 728 | 2021-06-28 | 05:37 | 19:51 |
| 729 | 2021-06-29 | 05:37 | 19:51 |
| 730 | 2021-06-30 | 05:37 | 19:51 |
731 rows × 3 columns
The day / night dataset includes the sunrise and sunset exact minute for each day.
Our initial expectation was that it would have a high correlation with our class, because people change habbits according to the daylight hours, especially in the evening (i.e. we expect sunset to be more informative than sunrise).
To be able to use the data, let's normalize it, merge with our main dataframe and check its correlation with class:
initial_sun_hours_df['sunrise'] = initial_sun_hours_df['sunrise'].str.split(':').apply(lambda tup: int(tup[0]) * 24 + int(tup[1]))
initial_sun_hours_df['sunset'] = initial_sun_hours_df['sunset'].str.split(':').apply(lambda tup: int(tup[0]) * 24 + int(tup[1]))
initial_sun_hours_df['sunrise'] = (initial_sun_hours_df['sunrise'] - initial_sun_hours_df['sunrise'].mean()) / (initial_sun_hours_df['sunrise'].max() - initial_sun_hours_df['sunrise'].min())
initial_sun_hours_df['sunset'] = (initial_sun_hours_df['sunset'] - initial_sun_hours_df['sunset'].mean()) / (initial_sun_hours_df['sunset'].max() - initial_sun_hours_df['sunset'].min())
initial_sun_hours_df['date'] = initial_sun_hours_df['date'].astype('datetime64[ns]')
initial_sun_hours_df
| date | sunrise | sunset | |
|---|---|---|---|
| 0 | 2019-07-01 | -0.176260 | 0.469884 |
| 1 | 2019-07-02 | -0.176260 | 0.469884 |
| 2 | 2019-07-03 | -0.176260 | 0.469884 |
| 3 | 2019-07-04 | -0.157029 | 0.459783 |
| 4 | 2019-07-05 | -0.157029 | 0.459783 |
| ... | ... | ... | ... |
| 726 | 2021-06-26 | -0.214722 | 0.469884 |
| 727 | 2021-06-27 | -0.214722 | 0.469884 |
| 728 | 2021-06-28 | -0.195491 | 0.469884 |
| 729 | 2021-06-29 | -0.195491 | 0.469884 |
| 730 | 2021-06-30 | -0.195491 | 0.469884 |
731 rows × 3 columns
sun_hours_df = normalized_df.copy()
sun_hours_df['date'] = sun_hours_df['datetime'].dt.date.astype('datetime64[ns]')
sun_hours_df = (
sun_hours_df
.merge(initial_sun_hours_df, on='date', how='left')
.drop(columns=['date'])
)
print(sun_hours_df.corr('spearman')['class'].sort_values(key=abs, ascending=False).to_string())
class 1.000000 hour -0.118387 sunrise 0.052818 Friday 0.022682 Thursday -0.017629 Tuesday -0.015695 Saturday 0.013945 Sunday -0.008499 sunset 0.005216 Wednesday 0.004536 Monday 0.000687
Surprisingly, sunrise is the one that has a much higher correlation than sunset.
Let's see how it helps our classifiers:
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sun_hours_df.drop(columns=['sunset']))
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot
)
results_df.loc['day night best classifier'] = score['Average']
results_df
Basel 0.746934 Asuta 0.787195 Sheraton 0.739457 Dan 0.834304 Dubnov 0.938606 Habima 0.941570 Average 0.831344
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
| weather best classifier | 0.836786 |
| day night best classifier | 0.831344 |
It seems that this new information only confuses the classifiers. Let's double check that adding sunset back doesn't help:
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sun_hours_df)
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
_ = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot
)
Basel 0.749975 Asuta 0.795808 Sheraton 0.769757 Dan 0.812975 Dubnov 0.922433 Habima 0.941296 Average 0.832041
Looks like we made the right decision leaving sunset out, but anyway, the new data does not help. We will leave it out.
initial_holiday_df = pd.read_csv('holidays.csv')
initial_holiday_df['date'] = initial_holiday_df.apply(lambda row: pd.Timestamp(date(year=row['year'], month=datetime.strptime(row['month'], '%b').month, day=row['day'])), axis=1)
initial_holiday_df = initial_holiday_df[['date', 'holiday_name', 'holiday_class']]
initial_holiday_df
| date | holiday_name | holiday_class | |
|---|---|---|---|
| 0 | 2019-01-21 | Tu Bishvat | no_vacation |
| 1 | 2019-03-20 | Fast of Esther | no_vacation |
| 2 | 2019-03-20 | Purim Eve | no_vacation |
| 3 | 2019-03-21 | Purim (Tel Aviv) | no_vacation |
| 4 | 2019-03-22 | Shushan Purim (Jerusalem) | no_vacation |
| ... | ... | ... | ... |
| 115 | 2021-05-10 | Jerusalem Day | no_vacation |
| 116 | 2021-05-16 | Shavuot Eve | erev_hag |
| 117 | 2021-05-17 | Shavuot | all_vacation |
| 118 | 2021-07-17 | Tisha B'Av Eve | no_vacation |
| 119 | 2021-07-18 | Tisha B'Av | no_vacation |
120 rows × 3 columns
What we have here is a list of all holidays in the relevant dates, including a large range of what can be called holidays:
no_vaction - days that are special but don't make most people change their routines (e.g. Tisha B'Av, Tu Bishvat)mainly_school_vacation - days that are special and make some people change their routines, especially children (e.g. Hanukkah, Purim)erev_hag - days that their morning is a vacation for some people and the evening is for most people (e.g. Erev Rosh HaShana)hol_hamoed - days that are vacations for some people (e.g. Passover (Day 4))all_vacation - days that are vacations for almost everybody (e.g. Rosh HaShana)elections - days that are supposed to happen once in four years and happened four times in our two years of dataLet's merge the dataframe with our main one and see how the class changes during different holidays:
holiday_df = normalized_df.copy()
holiday_df['date'] = holiday_df['datetime'].dt.date.astype('datetime64[ns]')
holiday_df = (
holiday_df
.merge(initial_holiday_df, on='date', how='left')
.drop(columns=['date', 'holiday_name'])
)
plt.rcParams["figure.figsize"] = (15,5)
_ = sns.histplot(
data=holiday_df[~holiday_df['holiday_class'].isna()],
x='holiday_class',
hue='class',
multiple='dodge',
palette=GREEN_RED_PALETTE,
shrink=0.7,
).set_title('Class Distribution Per Parking Lot')
Super interesting! It looks very indicative - erev_hag is mainly free, all_vacation and hol_hamoed are similar but a bit less, and elections are mainly full which is very unusual.
Let's validate our conclusions with correlation scores:
holiday_df['holiday_class'] = holiday_df['holiday_class'].fillna('no_holiday')
holiday_df = holiday_df.set_index(['lot', 'datetime'])
holiday_df = pd.get_dummies(holiday_df)
holiday_df = holiday_df.reset_index()
print(holiday_df.corr()['class'].sort_values(key=abs, ascending=False).to_string())
class 1.000000 hour -0.118215 holiday_class_erev_hag 0.066173 holiday_class_all_vacation 0.053475 holiday_class_no_holiday -0.053433 holiday_class_elections -0.043146 holiday_class_hol_hamoed 0.037790 Friday 0.022682 Thursday -0.017629 Tuesday -0.015695 Saturday 0.013945 Sunday -0.008499 holiday_class_mainly_school_vacation -0.007334 Wednesday 0.004536 Monday 0.000687 holiday_class_no_vacation -0.000614
Alright, the results show what we expected and we are ready to use it in our classification.
Let's get rid of the non-correlated ones and continue:
holiday_df = holiday_df.drop(columns=['holiday_class_mainly_school_vacation', 'holiday_class_no_vacation'])
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(holiday_df)
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot
)
results_df.loc['holidays best classifier'] = score['Average']
results_df
Basel 0.758445 Asuta 0.810096 Sheraton 0.725429 Dan 0.845124 Dubnov 0.943370 Habima 0.941570 Average 0.837339
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
| weather best classifier | 0.836786 |
| day night best classifier | 0.831344 |
| holidays best classifier | 0.837339 |
As expected, the data did help the results. Not as much as we expected, but any accuracy increase is good news.
The next step is, obviously, merging the good from each dataset together and see the results:
enriched_df = holiday_df.copy().merge(weather_df, on=[
'lot', 'datetime', 'class', 'hour',
'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'
])
display(enriched_df)
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(enriched_df)
| lot | datetime | class | hour | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | ... | holiday_class_all_vacation | holiday_class_elections | holiday_class_erev_hag | holiday_class_hol_hamoed | holiday_class_no_holiday | Heat Index | Cloud Cover | Clear | Partially cloudy | Rain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 50.0 | 0 | 1 | 0 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 50.0 | 0 | 1 | 0 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 50.0 | 0 | 1 | 0 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 44.6 | 0 | 1 | 0 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 33.0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 | 20 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 | 21 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 | 22 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 | 23 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 0.0 | 0 | 0 | 0 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 25.9 | 0.0 | 0 | 0 | 0 |
92268 rows × 21 columns
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['merged data best classifier'] = score['Average']
results_df
Basel 0.743675 Asuta 0.803573 Sheraton 0.773567 Dan 0.826138 Dubnov 0.955028 Habima 0.941570 Average 0.840592
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
| weather best classifier | 0.836786 |
| day night best classifier | 0.831344 |
| holidays best classifier | 0.837339 |
| merged data best classifier | 0.840592 |
The final result is certainly higher than the "initial best classifier"!
We are now ready to test our final classifier on the test data.
test_results_df = pd.DataFrame(columns=['average score'])
lot_to_best_clf = find_lot_to_best_clf(all_clf_confs)
score = evaluate_model(
model_training_func=lambda train_x, train_y, lot: lot_to_best_clf[lot][0](**lot_to_best_clf[lot][1]).fit(train_x, train_y),
model_evaluating_func=lambda model, test_x, test_y: accuracy_score(y_true=test_y, y_pred=model.predict(test_x)),
datasets_per_lot=[test_dataset_per_lot]
)
test_results_df.loc['sklearn best classifiers'] = score['Average']
test_results_df
Basel 0.781965 Asuta 0.814266 Sheraton 0.737550 Dan 0.804845 Dubnov 0.990579 Habima 0.909825 Average 0.839838
| average score | |
|---|---|
| sklearn best classifiers | 0.839838 |
These results are amazing. Not only is the average score pretty high, but it is also very consistent with our validation results, which means we didn't overfit the training and validation data.
Let's utilize keras deep learning models to improve our prediction accuracy!
Before trying some fancy deep learning models, we should start with a simple DNN that attempts to predict parking availability status by our time features (sin/cos of day/week).
from functools import partial
def _model_training_func(train_x, train_y, lot, model_creation_func, epochs=10, patience=2,
loss_function=tf.losses.MeanSquaredError, metrics_function=tf.keras.metrics.BinaryAccuracy,
optimizer_function=tf.optimizers.Adam):
model = model_creation_func()
train_x = np.array(train_x, dtype=np.float32)
train_y = np.array(train_y, dtype=np.float32)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=patience, mode='min')
model.compile(loss=loss_function(),
optimizer=optimizer_function(),
metrics=[metrics_function()])
model.fit(train_x, train_y, epochs=epochs, verbose=0, callbacks=[early_stopping])
return model
def _model_evaluating_func(model, test_x, test_y):
test_x = np.array(test_x, dtype=np.float32)
test_y = np.array(test_y, dtype=np.float32)
return model.evaluate(test_x, test_y, verbose=0)[1]
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
def _create_simple_clsf_model():
return tf.keras.Sequential([
tf.keras.layers.Dense(units=1)
])
score = evaluate_model(
model_training_func=partial(_model_training_func, model_creation_func=_create_simple_clsf_model),
model_evaluating_func=_model_evaluating_func,
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
Basel 0.665962 Asuta 0.728625 Sheraton 0.709820 Dan 0.813353 Dubnov 0.921684 Habima 0.941570 Average 0.796836
Great, we got decent prediction accuracy with the linear model. We might get better results if we add a hidden layer.
def _create_dense_clsf_model():
return tf.keras.Sequential([
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1)
])
score = evaluate_model(
model_training_func=partial(_model_training_func, model_creation_func=_create_dense_clsf_model),
model_evaluating_func=_model_evaluating_func,
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
Basel 0.733688 Asuta 0.777998 Sheraton 0.716040 Dan 0.792270 Dubnov 0.923427 Habima 0.941570 Average 0.814166
We got some really nice results! Let's see what happens when we add more hidden layers with dropout to prevent overfitting.
def _create_dense2_clsf_model():
return tf.keras.Sequential([
tf.keras.layers.Dense(units=32),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dropout(0.01),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1)
])
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_model_training_func, model_creation_func=_create_dense2_clsf_model),
model_evaluating_func=_model_evaluating_func,
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['keras simple dense classifier'] = score['Average']
Basel 0.737550 Asuta 0.781054 Sheraton 0.728614 Dan 0.784143 Dubnov 0.914092 Habima 0.938299 Average 0.813959
Adding more and more layer not necessarily improved accuracy. Some lots improved while other didn't. Let's try to add more features to the model, like we did before
sin_enriched_df = (sin_normalized_df
.merge(enriched_df, on=["datetime", "lot", "class"])
.drop(columns=['hour', 'Friday', 'Monday', 'Saturday', 'Sunday','Thursday', 'Tuesday', 'Wednesday'])
.copy())
def _create_dense3_clsf_model():
return tf.keras.Sequential([
tf.keras.layers.Dense(units=4, activation='relu'),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1)
])
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_enriched_df)
score = evaluate_model(
model_training_func=partial(_model_training_func, model_creation_func=_create_dense3_clsf_model, epochs=20, patience=5),
model_evaluating_func=_model_evaluating_func,
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['keras enriched dense classifier'] = score['Average']
Basel 0.670514 Asuta 0.757215 Sheraton 0.712472 Dan 0.802004 Dubnov 0.908634 Habima 0.941570 Average 0.798735
Overall, it seems like adding more features confused our model. Lots like Basel and Asuta are usually not affected by weather or vacations and we got less accurate predictions there. On the other hand, seasonal lots located near attractions such as Sheraton, Dan and Habima got a better prediction.
Let's evaluate our best model on test data.
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_model_training_func, model_creation_func=_create_dense2_clsf_model),
model_evaluating_func=_model_evaluating_func,
datasets_per_lot=[test_dataset_per_lot],
)
test_results_df.loc['keras simple dense classifier'] = score['Average']
Basel 0.741588 Asuta 0.779273 Sheraton 0.713324 Dan 0.790040 Dubnov 0.881561 Habima 0.909825 Average 0.802602
So far we only classified data based on their features without the context of their timely order.
In the next section, we will create models that forecasts new timesteps based on prior information.
The model's input will be [input_timesteps, labels+features] and output would be [output_timesteps, labels].
For example, a model that forecasts 3 hours into to future based on 6 hours with 4 labels (sin/cos of day/week) and 1 feature (class) would have input shape of [6, 5] and output shape of [3, 1].
To build such dataset, we will use tf.keras.preprocessing.timeseries_dataset_from_array that breaks the dataset into batches of consecutive timeseries with predefined lengths. Then we will split the input from the outputs.
def _generate_dataset(df_x, df_y, input_timesteps, output_timesteps,
y_labels_slice=slice(-1, None), add_y_features_to_x=False):
# df_x and df_y split is irrelevant and we need to rebuild it
df = df_x.copy()
df['class'] = df_y
# normalize dataframe types
data_arr = np.array(df, dtype=np.float32)
# create timeseries dataset
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data_arr,
targets=None,
sequence_stride=1,
sequence_length=input_timesteps + output_timesteps,
shuffle=True,
batch_size=32
)
def _split_x_y(batch):
x = batch[:, :input_timesteps, :]
y = batch[:, input_timesteps:, y_labels_slice]
if add_y_features_to_x and input_timesteps == output_timesteps:
y_vec = tf.unstack(batch[:, input_timesteps:, :], axis=2)
del y_vec[y_labels_slice]
y_features = tf.stack(y_vec, axis=2)
x = tf.concat([x, y_features], axis=2)
# set shapes for clarity
x.set_shape([None, input_timesteps, None])
y.set_shape([None, output_timesteps, None])
return x, y
ds = ds.map(_split_x_y)
return ds
def _ts_model_training_func(train_x, train_y, lot, model_creation_func, epochs=10, patience=2, input_timesteps=1, output_timesteps=1, fit_model=True,
loss_function=tf.losses.MeanSquaredError, metrics_function=tf.keras.metrics.BinaryAccuracy,
optimizer_function=tf.optimizers.Adam, y_labels_slice=slice(-1, None),
val_x=None, val_y=None, add_y_features_to_x=False):
model = model_creation_func()
dataset = _generate_dataset(train_x, train_y, input_timesteps, output_timesteps, y_labels_slice=y_labels_slice, add_y_features_to_x=add_y_features_to_x)
callbacks = []
dataset_val = None
if val_x is not None:
dataset_val = _generate_dataset(val_x, val_y,
input_timesteps, output_timesteps,
y_labels_slice=y_labels_slice, add_y_features_to_x=add_y_features_to_x)
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, mode='min')]
model.compile(loss=loss_function(),
optimizer=optimizer_function(),
metrics=[metrics_function()])
if fit_model:
model.fit(dataset, epochs=epochs, verbose=0, validation_data=dataset_val)
return model
def _ts_model_evaluating_func(model, test_x, test_y,
input_timesteps=1, output_timesteps=1,
y_labels_slice=slice(-1, None), add_y_features_to_x=False):
dataset = _generate_dataset(test_x, test_y,
input_timesteps, output_timesteps,
y_labels_slice=y_labels_slice, add_y_features_to_x=add_y_features_to_x)
return model.evaluate(dataset, verbose=0)[1]
class BaselineRepeat(tf.keras.Model):
def __init__(self, output_timesteps=None):
super().__init__()
self._output_timesteps = output_timesteps
def call(self, inputs):
# return inputs[:, :, -1:]
# inputs is a tensor with shape [batch, timesteps, features]
input_timesteps = inputs.shape[1]
output_timesteps = self._output_timesteps or input_timesteps
multiplier = (output_timesteps // input_timesteps) + 1
outputs = tf.concat([inputs]*multiplier, axis=1)[:, :output_timesteps, -1:]
return outputs
def _create_baseline_repeat(output_timesteps=None):
return BaselineRepeat(output_timesteps)
First, let's run the repeat baseline with 1 hour prediction. Model gets features and label of one hour and predicts the label of the next hour.
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func, model_creation_func=_create_baseline_repeat, input_timesteps=1 ,output_timesteps=1, fit_model=False),
model_evaluating_func=partial(_ts_model_evaluating_func, output_timesteps=1),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['keras 1h prediction baseline'] = score['Average']
Basel 0.874685 Asuta 0.881424 Sheraton 0.891933 Dan 0.955687 Dubnov 0.971320 Habima 0.974988 Average 0.925006
Great! we got really nice results here. Let's see if we can push this further using deep learning model.
linear_ts_model = None
def create_linear_ts_model():
global linear_ts_model
linear_ts_model = tf.keras.Sequential([
tf.keras.layers.Dense(units=1)
])
return linear_ts_model
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_linear_ts_model,
input_timesteps=1, output_timesteps=1,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=1, output_timesteps=1,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=False
)
Basel 0.862949 Asuta 0.872680 Sheraton 0.872862 Dan 0.933186 Dubnov 0.968760 Habima 0.958135 Average 0.911429
Thats's interesting, the simple dnn model could not beat the baseline model. Our model found a local optima that was not good enough. Let's dig into our model and see what it learned:
linear_ts_model.weights
[<tf.Variable 'dense_677/kernel:0' shape=(9, 1) dtype=float32, numpy=
array([[ 0.12353326],
[ 0.10617913],
[-0.01744407],
[-0.05675456],
[ 0.36297145],
[-0.165286 ],
[ 0.07579748],
[-0.00702313],
[ 0.06068982]], dtype=float32)>,
<tf.Variable 'dense_677/bias:0' shape=(1,) dtype=float32, numpy=array([0.48075154], dtype=float32)>]
As expected, the most significant feature used for prediction was the last hour's label (class).
Let's see what happens when we use different optimizer such as SGD.
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_linear_ts_model,
input_timesteps=1 ,output_timesteps=1,
optimizer_function=tf.keras.optimizers.SGD,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=1, output_timesteps=1,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
Basel 0.874685 Asuta 0.881424 Sheraton 0.892216 Dan 0.955551 Dubnov 0.971320 Habima 0.975124 Average 0.925053
linear_ts_model.weights
[<tf.Variable 'dense_737/kernel:0' shape=(9, 1) dtype=float32, numpy=
array([[ 0.06649537],
[ 0.42063746],
[-0.00159483],
[-0.51666075],
[ 0.7462543 ],
[-0.21138157],
[-0.33588633],
[ 0.00955346],
[ 0.5046514 ]], dtype=float32)>,
<tf.Variable 'dense_737/bias:0' shape=(1,) dtype=float32, numpy=array([0.18502732], dtype=float32)>]
This time, the model scored virtually the same as our baseline. When picking into its weights, we can see it gave the class feature more significance.
Let's try again with another hidden layer and activation.
dense_ts_model = None
def create_dense_ts_model():
global dense_ts_model
dense_ts_model = tf.keras.Sequential([
tf.keras.layers.Dense(units=32, activation='relu'),
tf.keras.layers.Dense(units=1)
])
return dense_ts_model
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_dense_ts_model,
input_timesteps=1, output_timesteps=1,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=1, output_timesteps=1,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
Basel 0.872008 Asuta 0.888570 Sheraton 0.891413 Dan 0.954801 Dubnov 0.971460 Habima 0.975124 Average 0.925563
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=_create_baseline_repeat,
input_timesteps=24 ,output_timesteps=24, fit_model=False),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=24, output_timesteps=24),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION]
)
results_df.loc['keras 24h prediction baseline'] = score['Average']
Basel 0.662099 Asuta 0.717271 Sheraton 0.776665 Dan 0.888338 Dubnov 0.941868 Habima 0.923054 Average 0.818216
Baseline model reached 82% accuracy! Like before, let's try to improve accuracy using deep learning models.
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_dense_ts_model,
input_timesteps=24, output_timesteps=24,
optimizer_function=tf.keras.optimizers.SGD, epochs=20,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=24, output_timesteps=24,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
results_df.loc['keras 24h prediction dense'] = score['Average']
Basel 0.741921 Asuta 0.792630 Sheraton 0.794947 Dan 0.884861 Dubnov 0.948757 Habima 0.947362 Average 0.851746
This is great! Even though our model is still pretty basic, we scored with pretty decent accuracy!
Next, we will try to improve accuracy by granting our model access to multiple timesteps features each time.
The input of our model is a 3d tensor with shape ofbatch, timestep, features. Dense layers operate only on the 3rd axis, so different timesteps are tuned separately.
We can use Flatten or Conv1D layers to operate on multiple timesteps simultaneously.
def create_ts_conv_model():
return tf.keras.Sequential([
tf.keras.layers.Conv1D(256, activation='relu', kernel_size=(24)),
tf.keras.layers.Dense(24, kernel_initializer=tf.initializers.zeros()),
tf.keras.layers.Reshape([24, 1]),
])
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_ts_conv_model,
input_timesteps=24, output_timesteps=24,
optimizer_function=tf.keras.optimizers.SGD, epochs=20,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=24, output_timesteps=24,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
results_df.loc['keras 24h prediction cnn'] = score['Average']
Basel 0.717740 Asuta 0.769135 Sheraton 0.800671 Dan 0.885167 Dubnov 0.950658 Habima 0.948454 Average 0.845304
Adding CNN layer did not improve our accuracy at all. Let's try another approach and use LSTM layer that learns "trends" with fewer trainable variables.
def create_ts_lstm_model():
return tf.keras.Sequential([
tf.keras.layers.LSTM(32, return_sequences=False),
tf.keras.layers.Dense(24, kernel_initializer=tf.initializers.zeros()),
tf.keras.layers.Reshape([24, 1]),
])
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_ts_lstm_model,
input_timesteps=24, output_timesteps=24,
optimizer_function=tf.keras.optimizers.SGD, epochs=20,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=24, output_timesteps=24,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
Basel 0.674594 Asuta 0.745931 Sheraton 0.773412 Dan 0.881486 Dubnov 0.935278 Habima 0.949014 Average 0.826619
Still not improving. The LSTM layer could hide the y features (sin/cos of day/week) we injected into our dataset. To overcome this issue, we can come up with non-sequential model that feeds the following Dense layer with the original y features input.
To do so, we will create a custom class that derives from keras Model, and write our logic.
class LstmWithYFeatures(tf.keras.Model):
def __init__(self):
super().__init__()
self.lstm = tf.keras.layers.LSTM(32, return_sequences=False)
self.dense = tf.keras.layers.Dense(32, activation='relu')
self.dense_output = tf.keras.layers.Dense(1)
def call(self, inputs, training=None):
# extract only x-features
x = inputs[:, :, :5]
y_features = inputs[:, :, 5:]
# (batch, steps, x_features) => (batch, lstm_units)
lstm_units = self.lstm(x)
# (batch, lstm_units) => (batch, 1, lstm_units)
lstm_units = lstm_units[:, tf.newaxis, :]
# (batch, 1, lstm_units) => (batch, steps, lstm_units)
lstm_units = tf.concat([lstm_units]*24, axis=1)
# (batch, steps, lstm_units) => (batch, steps, lstm_unit+y_features)
lstm_with_y_features = tf.concat([lstm_units, y_features], axis=2)
# (batch, steps, lstm_unit+y_features) => (batch, steps, dense_units)
predictions = self.dense(lstm_with_y_features)
# (batch, steps, dense_units) => (batch, steps, y_label)
predictions = self.dense_output(predictions)
return predictions
def create_lstm2_model():
return LstmWithYFeatures()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(sin_normalized_df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_lstm2_model,
input_timesteps=24, output_timesteps=24,
optimizer_function=tf.keras.optimizers.SGD, epochs=20,
add_y_features_to_x=True
),
model_evaluating_func=partial(_ts_model_evaluating_func,
input_timesteps=24, output_timesteps=24,
add_y_features_to_x=True),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
results_df.loc['keras 24h prediction lstm'] = score['Average']
Basel 0.755854 Asuta 0.788649 Sheraton 0.803709 Dan 0.896902 Dubnov 0.950410 Habima 0.947914 Average 0.857240
We managed to improve our 24h prediction accuracy even more! Lets see how we're doing so far:
with pd.option_context("display.max_rows", 50):
display(results_df.sort_values(by='average score'))
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| keras enriched dense classifier | 0.798735 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| keras simple dense classifier | 0.813959 |
| keras 24h prediction baseline | 0.818216 |
| advanced time norm best classifier | 0.823886 |
| day night best classifier | 0.831344 |
| initial best classifier | 0.832545 |
| weather best classifier | 0.836786 |
| holidays best classifier | 0.837339 |
| merged data best classifier | 0.840592 |
| keras 24h prediction cnn | 0.845304 |
| keras 24h prediction dense | 0.851746 |
| keras 24h prediction lstm | 0.857240 |
| keras 1h prediction baseline | 0.925006 |
In the previous section, we used single-shot models that predicted multiple steps simultaneously.
Another approach we can try, is to create self-feeding model that predicts multiple steps, but one step at a time.
For example, we can write a simple AR model based on simple dense layer with the following logic:
class DenseAR(tf.keras.Model):
def __init__(self, units, in_steps, out_steps):
super().__init__()
self.in_steps = in_steps
self.out_steps = out_steps
self.units = units
self.conv = tf.keras.layers.Conv1D(units, activation='relu', kernel_size=(in_steps))
self.dense = tf.keras.layers.Dense(5)
self._y_features = None
def _extract_time_features(self, data):
# this time we inject the relevant y features directly from the y dataset
# the last models didn't do it and expected the inputs to be already injected.
_, y, _ = tf.keras.utils.unpack_x_y_sample_weight(data)
self._y_features = y[:, :, :-1]
return data
def train_step(self, data):
self._extract_time_features(data)
return super().train_step(data)
def predict_step(self, data):
self._extract_time_features(data)
return super().predict_step(data)
def test_step(self, data):
self._extract_time_features(data)
return super().test_step(data)
def call(self, inputs, training=None, y_features=None):
if y_features is not None:
self._y_features = y_features[:, :, :-1]
predictions = inputs
# Run the rest of the prediction steps
for n in range(self.out_steps):
# (batch, steps, features) => (batch, conv_units)
prediction = self.conv(predictions)
# (batch, conv_units) => (batch, y_predicted_features + y_predicted_labels)
prediction = self.dense(prediction)
# (batch, y_predicted_features + y_predicted_labels) => (batch, y_features + y_predicted_labels)
prediction = tf.concat([self._y_features[:,n,:], prediction[:, 0, -1:]], axis=1)
# shift the inputs tensor right with last prediction
predictions = tf.concat(
(predictions[:, 1:, :], prediction[:, tf.newaxis, :]),
axis=1
)
return predictions
Since our model returns features and labels (instead of just the single label - class), we need to customize our metrics and loss functions to ignore irrelevant labels.
class LabelMSE(tf.losses.MeanAbsoluteError):
def call(self, y_true, y_pred):
y_true = y_true[:, :, -1]
y_pred = y_pred[:, :, -1]
return super().call(y_true, y_pred)
class LabelBinaryAccuracy(tf.keras.metrics.Accuracy):
def update_state(self, y_true, y_pred, sample_weight=None):
y_true = y_true[:, :, -1]
y_pred = y_pred[:, :, -1]
y_true = tf.greater_equal(y_true, 0.5)
y_pred = tf.greater_equal(y_pred, 0.5)
return super().update_state(y_true, y_pred, sample_weight)
def create_ar_model():
return DenseAR(128, 24, 24)
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
results_df.loc['keras 24h prediction AR dense'] = score['Average']
Basel 0.768687 Asuta 0.814380 Sheraton 0.830807 Dan 0.912346 Dubnov 0.956796 Habima 0.948850 Average 0.871978
Amazing! we pushed further and we're almost reaching 88%. This score seems promising, so we will evaluate it on test data
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=[test_dataset_per_lot]
)
test_results_df.loc['keras 24h prediction AR dense'] = score['Average']
Basel 0.760656 Asuta 0.818068 Sheraton 0.845366 Dan 0.799210 Dubnov 1.000000 Habima 0.909004 Average 0.855384
Next, we will try to run an LSTM model with the same autoregressive approach. We will create warm up step to prepare our lstm state, and then continue with feeding this model its own predictions.
class LstmAR(tf.keras.Model):
def __init__(self, units, out_steps):
super().__init__()
self.out_steps = out_steps
self.units = units
self.lstm_cell = tf.keras.layers.LSTMCell(units)
self.lstm_rnn = tf.keras.layers.RNN(self.lstm_cell, return_state=True)
self.dense2 = tf.keras.layers.Dense(32, activation='relu')
self.dense = tf.keras.layers.Dense(5)
self._y_features = None
def _extract_time_features(self, data):
_, y, _ = tf.keras.utils.unpack_x_y_sample_weight(data)
self._y_features = y[:, :, :-1]
return data
def train_step(self, data):
self._extract_time_features(data)
return super().train_step(data)
def predict_step(self, data):
self._extract_time_features(data)
return super().predict_step(data)
def test_step(self, data):
self._extract_time_features(data)
return super().test_step(data)
def warmup(self, inputs):
# (batch, steps, features) => (batch, lstm_units), (batch, states)
x, *state = self.lstm_rnn(inputs)
prediction = self.dense2(x)
prediction = self.dense(prediction)
return prediction, state
def call(self, inputs, training=None, y_features=None):
if y_features is not None:
self._y_features = y_features[:, :, :-1]
predictions = []
prediction, state = self.warmup(inputs)
prediction = prediction[:,-1:]
prediction = tf.concat([self._y_features[:,0,:], prediction], axis=1)
# Insert the first prediction
predictions.append(prediction)
# Run the rest of the prediction steps
for n in range(1, self.out_steps):
# Use the last prediction as input.
x = prediction
# Execute one lstm step.
x, state = self.lstm_cell(x, states=state,
training=training)
prediction = self.dense2(x)
prediction = self.dense(prediction)
prediction = prediction[:, -1:]
prediction = tf.concat([self._y_features[:,n,:], prediction], axis=1)
# Add the prediction to the output
predictions.append(prediction)
# predictions.shape => (time, batch, features)
predictions = tf.stack(predictions)
# predictions.shape => (batch, time, features)
predictions = tf.transpose(predictions, [1, 0, 2])
return predictions
def create_lstm_ar_model():
return LstmAR(32, 24)
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_lstm_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
Basel 0.760442 Asuta 0.813321 Sheraton 0.816478 Dan 0.913845 Dubnov 0.955721 Habima 0.948954 Average 0.868127
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_lstm_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=[test_dataset_per_lot]
)
Basel 0.803580 Asuta 0.830041 Sheraton 0.844588 Dan 0.801305 Dubnov 1.000000 Habima 0.909842 Average 0.864893
As we saw before, LSTM could potentialy hide important features such as y-features (day/week sin/cos), we can feed our dense layer directly with it.
class LstmAR2(tf.keras.Model):
def __init__(self, units, out_steps):
super().__init__()
self.out_steps = out_steps
self.units = units
self.lstm_cell = tf.keras.layers.LSTMCell(units)
# Also wrap the LSTMCell in an RNN to simplify the `warmup` method.
self.lstm_rnn = tf.keras.layers.RNN(self.lstm_cell, return_state=True)
self.dense2 = tf.keras.layers.Dense(32, activation='relu')
self.dense = tf.keras.layers.Dense(5)
self._y_features = None
def _extract_time_features(self, data):
_, y, _ = tf.keras.utils.unpack_x_y_sample_weight(data)
self._y_features = y[:, :, :-1]
return data
def train_step(self, data):
self._extract_time_features(data)
return super().train_step(data)
def predict_step(self, data):
self._extract_time_features(data)
return super().predict_step(data)
def test_step(self, data):
self._extract_time_features(data)
return super().test_step(data)
def warmup(self, inputs):
# (batch, steps, features) => (batch, lstm_units), (batch, states)
x, *state = self.lstm_rnn(inputs)
# (batch, features) => (batch, y_features+features)
x = tf.concat([self._y_features[:,0,:], x], axis=1)
prediction = self.dense2(x)
prediction = self.dense(prediction)
return prediction, state
def call(self, inputs, training=None, y_features=None):
if y_features is not None:
self._y_features = y_features[:, :, :-1]
# Use a TensorArray to capture dynamically unrolled outputs.
predictions = []
# Initialize the lstm state
prediction, state = self.warmup(inputs)
prediction = prediction[:,-1:]
prediction = tf.concat([self._y_features[:,0,:], prediction], axis=1)
# Insert the first prediction
predictions.append(prediction)
# Run the rest of the prediction steps
for n in range(1, self.out_steps):
# Use the last prediction as input.
x = prediction
# Execute one lstm step.
x, state = self.lstm_cell(x, states=state,
training=training)
# Convert the lstm output to a prediction.
# prediction = self.dense2(x)
# prediction = self.dense3(x)
x = tf.concat([self._y_features[:,n,:], x], axis=1)
prediction = self.dense2(x)
prediction = self.dense(prediction)
prediction = prediction[:, -1:]
prediction = tf.concat([self._y_features[:,n,:], prediction], axis=1)
# Add the prediction to the output
predictions.append(prediction)
# predictions.shape => (time, batch, features)
predictions = tf.stack(predictions)
# predictions.shape => (batch, time, features)
predictions = tf.transpose(predictions, [1, 0, 2])
return predictions
def create_lstm2_ar_model():
return LstmAR2(32, 24)
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_lstm2_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=train_and_validation_datasets_per_lot,
detailed=True,
row_names=[f'{m}/{y}' for m, y in MONTHS_AND_YEARS_FOR_TRAIN_AND_VALIDATION],
use_test_as_val=True
)
results_df.loc['keras 24h prediction AR LSTM'] = score['Average']
Basel 0.763749 Asuta 0.807700 Sheraton 0.810242 Dan 0.906523 Dubnov 0.951637 Habima 0.949086 Average 0.864823
We improved a little bit, still behind the CNN-Dense AR model. Let's evaluate this model on test.
df = sin_normalized_df.copy()
train_and_validation_datasets_per_lot, test_dataset_per_lot = create_train_validation_test_datasets(df)
score = evaluate_model(
model_training_func=partial(_ts_model_training_func,
model_creation_func=create_lstm2_ar_model,
input_timesteps=24 ,output_timesteps=24, epochs=20,
loss_function=LabelMSE, metrics_function=LabelBinaryAccuracy,
y_labels_slice=slice(None)),
model_evaluating_func=partial(_ts_model_evaluating_func, input_timesteps=24, output_timesteps=24, y_labels_slice=slice(None)),
datasets_per_lot=[test_dataset_per_lot]
)
test_results_df.loc['keras 24h prediction AR LSTM'] = score['Average']
Basel 0.800407 Asuta 0.830879 Sheraton 0.830999 Dan 0.792445 Dubnov 0.985752 Habima 0.909662 Average 0.858357
with pd.option_context("display.max_rows", 50):
display(results_df)
| average score | |
|---|---|
| initial logistic regression | 0.755704 |
| initial decision tree | 0.806975 |
| initial random forest | 0.808463 |
| initial best classifier | 0.832545 |
| advanced time norm best classifier | 0.823886 |
| weather best classifier | 0.836786 |
| day night best classifier | 0.831344 |
| holidays best classifier | 0.837339 |
| merged data best classifier | 0.840592 |
| keras simple dense classifier | 0.813959 |
| keras enriched dense classifier | 0.798735 |
| keras 1h prediction baseline | 0.925006 |
| keras 24h prediction baseline | 0.818216 |
| keras 24h prediction dense | 0.851746 |
| keras 24h prediction cnn | 0.845304 |
| keras 24h prediction lstm | 0.857240 |
| keras 24h prediction AR dense | 0.871978 |
| keras 24h prediction AR LSTM | 0.864823 |
with pd.option_context("display.max_rows", 50):
display(test_results_df)
| average score | |
|---|---|
| sklearn best classifiers | 0.839838 |
| keras simple dense classifier | 0.802602 |
| keras 24h prediction AR dense | 0.855384 |
| keras 24h prediction AR LSTM | 0.858357 |
We can compare our predictions to the true values to find special patterns and anomalies which could help us in the future improve.
First, let's take our sklearn best classifier we found before and train them with all our data.
_, test_dataset_per_lot = create_train_validation_test_datasets(enriched_df)
classic_models = {lot: model[0](**model[1]) for lot, model in lot_to_best_clf.items()}
for lot, lot_df in tqdm(test_dataset_per_lot.items()):
x, y = split_df_to_x_y(lot_df[0])
classic_models[lot].fit(x, y)
Then, we can create special dataframe with success column
predict_df = dict()
for lot_name, lot_df in enriched_df.groupby('lot'):
df = lot_df.copy()
x, y = split_df_to_x_y(lot_df.drop(columns=['lot', 'datetime']))
y_predicts = classic_models[lot_name].predict(x)
df['prediction'] = y_predicts
df.loc[df['prediction'] == df['class'], 'success'] = 1
df.loc[df['prediction'] != df['class'], 'success'] = 0
df['date'] = df['datetime'].dt.normalize()
df['week_first_day'] = (
((df['date'] - FIRST_DAY_OF_FIRST_WHOLE_WEEK).dt.days / 7)
.astype(int)
.apply(lambda week_idx: FIRST_DAY_OF_FIRST_WHOLE_WEEK + pd.to_timedelta(timedelta(weeks=week_idx)))
)
df['day_hour'] = df['date'].dt.day_name() + '-' + df['datetime'].dt.hour.astype(str).str.zfill(2)
predict_df[lot_name] = df
predict_with_lot_name = predict_df.copy()
for lot in LOT_NAMES:
predict_with_lot_name[lot]['lot'] = lot
predict_with_lot_name[lot]
predict_with_lot_name = pd.concat(predict_with_lot_name.values())
predict_with_lot_name
| lot | datetime | class | hour | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | ... | Heat Index | Cloud Cover | Clear | Partially cloudy | Rain | prediction | success | date | week_first_day | day_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Asuta | 2019-07-14 00:00:00+03:00 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 50.0 | 0 | 1 | 0 | 0 | 1.0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-00 |
| 1 | Asuta | 2019-07-14 01:00:00+03:00 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 50.0 | 0 | 1 | 0 | 0 | 1.0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-01 |
| 2 | Asuta | 2019-07-14 02:00:00+03:00 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 50.0 | 0 | 1 | 0 | 0 | 1.0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-02 |
| 3 | Asuta | 2019-07-14 03:00:00+03:00 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 44.6 | 0 | 1 | 0 | 0 | 1.0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-03 |
| 4 | Asuta | 2019-07-14 04:00:00+03:00 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 33.0 | 0 | 1 | 0 | 0 | 1.0 | 2019-07-14 00:00:00+03:00 | 2019-07-14 00:00:00+03:00 | Sunday-04 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92263 | Sheraton | 2021-06-20 20:00:00+03:00 | 0 | 20 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 0.0 | 0 | 0 | 0 | 1 | 0.0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-20 |
| 92264 | Sheraton | 2021-06-20 21:00:00+03:00 | 0 | 21 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 0.0 | 0 | 0 | 0 | 1 | 0.0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-21 |
| 92265 | Sheraton | 2021-06-20 22:00:00+03:00 | 0 | 22 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 0.0 | 0 | 0 | 0 | 1 | 0.0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-22 |
| 92266 | Sheraton | 2021-06-20 23:00:00+03:00 | 1 | 23 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 25.9 | 0.0 | 0 | 0 | 0 | 1 | 1.0 | 2021-06-20 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Sunday-23 |
| 92267 | Sheraton | 2021-06-21 00:00:00+03:00 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 25.9 | 0.0 | 0 | 0 | 0 | 1 | 1.0 | 2021-06-21 00:00:00+03:00 | 2021-06-20 00:00:00+03:00 | Monday-00 |
92268 rows × 26 columns
Just like we did at the beginning, we can use our data exploration visualizations to learn about our accuracy.
fig, axes = plt.subplots(len(LOT_NAMES), 1, figsize=(15, 8 * len(LOT_NAMES)), sharey=True)
fig.suptitle('Model Accuracy Per Hour', fontsize=22, y=0.895)
plt.subplots_adjust(hspace=0.4)
for i, lot in enumerate(LOT_NAMES):
df = predict_df[lot]
ax = axes[i]
plot = sns.scatterplot(
ax=ax,
palette=GREEN_RED_PALETTE,
legend=False,
x='day_hour',
y='week_first_day',
hue='success',
s=10,
data=df
)
ax.set_title(lot)
for ax in axes:
for i, label in enumerate(ax.get_xticklabels()):
if i % 4 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
On most of the parking lots, we can spot some patterns of missed predictions. It seems as the pattern is repeated on certain hours and days.
Let's continue with our exploration and look for weekly patterns.
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Hour In The Week', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='day_hour',
y='success',
data=predict_with_lot_name
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
g = sns.lineplot(
ax=ax,
x='day_hour',
y='success',
data=predict_with_lot_name[predict_with_lot_name['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0.5, top=1)
for i, label in enumerate(ax.get_xticklabels()):
if i % 8 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
Intersting. We definitely see some hourly patterns here! Let's zoom in and group by the hour
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Hour', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='hour',
y='success',
data=predict_with_lot_name
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
g = sns.lineplot(
ax=ax,
x='hour',
y='success',
data=predict_with_lot_name[predict_with_lot_name['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0.5, top=1)
for i, label in enumerate(ax.get_xticklabels()):
label.set_rotation(90)
Now we got it! We can clearly see the unique patterns each lot has!
Let's see if we can spot any anomalies over time
weekly_predict_df = (
predict_with_lot_name
.groupby(by=['week_first_day', 'lot'])
.agg({'success': 'mean'})
.reset_index()
)
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Week Over Time', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='success',
data=weekly_predict_df
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='success',
data=weekly_predict_df[weekly_predict_df['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0, top=1)
for i, label in enumerate(ax.get_xticklabels()):
label.set_rotation(90)
weekly_predict_df[weekly_predict_df['lot'] == 'Basel'].sort_values(by=['success']).head(10)
| week_first_day | lot | success | |
|---|---|---|---|
| 97 | 2019-11-02 23:00:00+02:00 | Basel | 0.565476 |
| 415 | 2020-12-19 23:00:00+02:00 | Basel | 0.607143 |
| 247 | 2020-04-26 00:00:00+03:00 | Basel | 0.614286 |
| 223 | 2020-03-29 00:00:00+03:00 | Basel | 0.628205 |
| 451 | 2021-01-30 23:00:00+02:00 | Basel | 0.630952 |
| 241 | 2020-04-19 00:00:00+03:00 | Basel | 0.642857 |
| 217 | 2020-03-21 23:00:00+02:00 | Basel | 0.658683 |
| 103 | 2019-11-09 23:00:00+02:00 | Basel | 0.672619 |
| 499 | 2021-03-28 00:00:00+03:00 | Basel | 0.672619 |
| 457 | 2021-02-06 23:00:00+02:00 | Basel | 0.678571 |
We can find some interesting anomalies on SOME of the data. For example the week of 21/03/20's was the beginning of the first lockdown
We can sift out some minor anomalies if we check our DNN models with sklearn's models.
First, we need to run and evaluate our model on all our data.
def create_ar_model():
return DenseAR(128, 24, 24)
def _split_x_y(batch):
x = batch[:, :24, :]
y = batch[:, 24:, :]
x.set_shape([None, 24, None])
y.set_shape([None, 24, None])
return x, y
models = {}
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=2, mode='min')
for lot_name, lot_df in tqdm(sin_normalized_df.groupby('lot')):
lot_df = lot_df[['day sin', 'day cos', 'week sin', 'week cos', 'class']]
arr = np.array(lot_df, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
arr,
targets=None,
sequence_stride=1,
sequence_length=48,
shuffle=True,
batch_size=32
)
ds = ds.map(_split_x_y)
model = create_ar_model()
model.compile(loss=LabelMSE(),
optimizer=tf.optimizers.Adam(),
metrics=[LabelBinaryAccuracy()])
model.fit(ds, epochs=20, verbose=1, callbacks=[early_stopping])
models[lot_name] = model
Epoch 1/20 480/480 [==============================] - 6s 6ms/step - loss: 0.2286 - accuracy: 0.8171 Epoch 2/20 480/480 [==============================] - 3s 6ms/step - loss: 0.2012 - accuracy: 0.8320 Epoch 3/20 480/480 [==============================] - 3s 7ms/step - loss: 0.1929 - accuracy: 0.8373 Epoch 4/20 480/480 [==============================] - 3s 7ms/step - loss: 0.1867 - accuracy: 0.8427 Epoch 5/20 480/480 [==============================] - 3s 7ms/step - loss: 0.1803 - accuracy: 0.8465 Epoch 6/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1783 - accuracy: 0.8488 Epoch 7/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1773 - accuracy: 0.8504 Epoch 8/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1765 - accuracy: 0.8520 Epoch 9/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1746 - accuracy: 0.8529 Epoch 10/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1756 - accuracy: 0.8535 Epoch 11/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1729 - accuracy: 0.8542 Epoch 12/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1725 - accuracy: 0.8539 Epoch 13/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1717 - accuracy: 0.8545 Epoch 14/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1722 - accuracy: 0.8546 Epoch 15/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1710 - accuracy: 0.8554 Epoch 16/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1741 - accuracy: 0.8541 Epoch 17/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1767 - accuracy: 0.8524 Epoch 1/20 480/480 [==============================] - 6s 5ms/step - loss: 0.2487 - accuracy: 0.7972 Epoch 2/20 480/480 [==============================] - 3s 5ms/step - loss: 0.2205 - accuracy: 0.8153 Epoch 3/20 480/480 [==============================] - 3s 6ms/step - loss: 0.2136 - accuracy: 0.8188 Epoch 4/20 480/480 [==============================] - 3s 6ms/step - loss: 0.2082 - accuracy: 0.8208 Epoch 5/20 480/480 [==============================] - 3s 6ms/step - loss: 0.2066 - accuracy: 0.8225 Epoch 6/20 480/480 [==============================] - 3s 7ms/step - loss: 0.2042 - accuracy: 0.8236 Epoch 7/20 480/480 [==============================] - 3s 6ms/step - loss: 0.2016 - accuracy: 0.8261 Epoch 8/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1995 - accuracy: 0.8273 Epoch 9/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1972 - accuracy: 0.8293 Epoch 10/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1957 - accuracy: 0.8307 Epoch 11/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1945 - accuracy: 0.8320 Epoch 12/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1950 - accuracy: 0.8326 Epoch 13/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1929 - accuracy: 0.8335 Epoch 14/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1942 - accuracy: 0.8338 Epoch 15/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1919 - accuracy: 0.8346 Epoch 16/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1918 - accuracy: 0.8353 Epoch 17/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1921 - accuracy: 0.8352 Epoch 18/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1890 - accuracy: 0.8361 Epoch 19/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1899 - accuracy: 0.8367 Epoch 20/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1884 - accuracy: 0.8371 Epoch 1/20 480/480 [==============================] - 6s 6ms/step - loss: 0.2111 - accuracy: 0.8265 Epoch 2/20 480/480 [==============================] - 3s 5ms/step - loss: 0.1728 - accuracy: 0.8625 Epoch 3/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1665 - accuracy: 0.8653 Epoch 4/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1607 - accuracy: 0.8679 Epoch 5/20 480/480 [==============================] - 3s 7ms/step - loss: 0.1582 - accuracy: 0.8689 Epoch 6/20 480/480 [==============================] - 3s 7ms/step - loss: 0.1539 - accuracy: 0.8699 Epoch 7/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1535 - accuracy: 0.8706 Epoch 8/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1505 - accuracy: 0.8714 Epoch 9/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1498 - accuracy: 0.8718 Epoch 10/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1479 - accuracy: 0.8728 Epoch 11/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1463 - accuracy: 0.8737 Epoch 12/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1461 - accuracy: 0.8743 Epoch 13/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1457 - accuracy: 0.8741 Epoch 14/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1425 - accuracy: 0.8757 Epoch 15/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1431 - accuracy: 0.8760 Epoch 16/20 480/480 [==============================] - 3s 6ms/step - loss: 0.1434 - accuracy: 0.8761 Epoch 1/20 479/479 [==============================] - 6s 5ms/step - loss: 0.1289 - accuracy: 0.9027 Epoch 2/20 479/479 [==============================] - 3s 5ms/step - loss: 0.1004 - accuracy: 0.9179 Epoch 3/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0959 - accuracy: 0.9200 Epoch 4/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0938 - accuracy: 0.9218 Epoch 5/20 479/479 [==============================] - 3s 7ms/step - loss: 0.0913 - accuracy: 0.9235 Epoch 6/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0908 - accuracy: 0.9249 Epoch 7/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0893 - accuracy: 0.9260 Epoch 8/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0878 - accuracy: 0.9270 Epoch 9/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0867 - accuracy: 0.9277 Epoch 10/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0860 - accuracy: 0.9285 Epoch 11/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0859 - accuracy: 0.9289 Epoch 12/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0844 - accuracy: 0.9294 Epoch 13/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0842 - accuracy: 0.9296 Epoch 14/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0840 - accuracy: 0.9300 Epoch 15/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0824 - accuracy: 0.9305 Epoch 16/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0817 - accuracy: 0.9309 Epoch 17/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0812 - accuracy: 0.9315 Epoch 18/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0805 - accuracy: 0.9319 Epoch 19/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0800 - accuracy: 0.9325 Epoch 20/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0794 - accuracy: 0.9327 Epoch 1/20 479/479 [==============================] - 6s 5ms/step - loss: 0.1040 - accuracy: 0.9138 Epoch 2/20 479/479 [==============================] - 3s 5ms/step - loss: 0.0913 - accuracy: 0.9187 Epoch 3/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0928 - accuracy: 0.9193 Epoch 4/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0890 - accuracy: 0.9191 Epoch 5/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0878 - accuracy: 0.9195 Epoch 6/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0874 - accuracy: 0.9195 Epoch 7/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0869 - accuracy: 0.9196 Epoch 8/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0861 - accuracy: 0.9198 Epoch 9/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0864 - accuracy: 0.9199 Epoch 10/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0858 - accuracy: 0.9205 Epoch 11/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0853 - accuracy: 0.9203 Epoch 12/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0842 - accuracy: 0.9217 Epoch 13/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0828 - accuracy: 0.9234 Epoch 14/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0827 - accuracy: 0.9245 Epoch 15/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0811 - accuracy: 0.9251 Epoch 16/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0816 - accuracy: 0.9256 Epoch 17/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0808 - accuracy: 0.9263 Epoch 18/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0803 - accuracy: 0.9266 Epoch 19/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0804 - accuracy: 0.9266 Epoch 20/20 479/479 [==============================] - 3s 6ms/step - loss: 0.0795 - accuracy: 0.9270 Epoch 1/20 479/479 [==============================] - 6s 5ms/step - loss: 0.2374 - accuracy: 0.8076 Epoch 2/20 479/479 [==============================] - 3s 5ms/step - loss: 0.2032 - accuracy: 0.8347 Epoch 3/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1905 - accuracy: 0.8424 Epoch 4/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1834 - accuracy: 0.8470 Epoch 5/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1802 - accuracy: 0.8493 Epoch 6/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1785 - accuracy: 0.8503 Epoch 7/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1759 - accuracy: 0.8511 Epoch 8/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1757 - accuracy: 0.8513 Epoch 9/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1743 - accuracy: 0.8519 Epoch 10/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1728 - accuracy: 0.8520 Epoch 11/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1710 - accuracy: 0.8531 Epoch 12/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1713 - accuracy: 0.8537 Epoch 13/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1695 - accuracy: 0.8543 Epoch 14/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1700 - accuracy: 0.8541 Epoch 15/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1693 - accuracy: 0.8549 Epoch 16/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1682 - accuracy: 0.8551 Epoch 17/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1682 - accuracy: 0.8552 Epoch 18/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1676 - accuracy: 0.8555 Epoch 19/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1678 - accuracy: 0.8552 Epoch 20/20 479/479 [==============================] - 3s 6ms/step - loss: 0.1666 - accuracy: 0.8557
We can created unified dataframe with sklearn's predictions and see where both of the models did not succeed.
predict_df2 = dict()
for lot_name, lot_df in tqdm(sin_normalized_df.groupby('lot')):
lot_df = lot_df[['day sin', 'day cos', 'week sin', 'week cos', 'class', 'datetime']]
ts = lot_df.pop('datetime').view('int64')
arr = np.array(lot_df, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
arr,
targets=None,
sequence_stride=1,
sequence_length=48,
shuffle=False, # Do not mix up batches
batch_size=32
)
datetime_arr = np.array(pd.DataFrame(ts, columns=['datetime']), dtype=np.int64)
datetime_ds = tf.keras.preprocessing.timeseries_dataset_from_array(
datetime_arr,
targets=None,
sequence_stride=1,
sequence_length=48,
shuffle=False, # Do not mix up batches
batch_size=32
)
batches = list()
for dt in datetime_ds:
batches.append(dt[:, 24:, :])
batch_to_datetime = tf.concat(batches, axis=0)
ds = ds.map(_split_x_y)
model = models[lot_name]
predictions = model.predict(ds)
predictions = tf.concat([predictions, batch_to_datetime], axis=2)
data = tf.concat(tf.unstack(predictions, axis=0), axis=0).numpy()
df = pd.DataFrame(data, columns=['day sin', 'day cos', 'week sin', 'week cos', 'prediction_dnn', 'datetime_s'])
df['prediction_dnn'] = (df['prediction_dnn'] >= 0.5).astype(np.float32)
df = df[['prediction_dnn', 'datetime_s']]
df = df.groupby(by=['datetime_s']).agg({'prediction_dnn': 'mean'}).reset_index()
df['prediction_dnn'] = (df['prediction_dnn'] >= 0.5).astype(np.float32)
df2 = predict_df[lot_name].copy()
df2['datetime_s'] = df2['datetime'].view('int64')
df = df.merge(df2, on=['datetime_s'], how='inner')
df.loc[df['prediction_dnn'] == df['class'], 'success_dnn'] = 1
df.loc[df['prediction_dnn'] != df['class'], 'success_dnn'] = 0
df['date'] = df['datetime'].dt.normalize()
df['week_first_day'] = (
((df['date'] - FIRST_DAY_OF_FIRST_WHOLE_WEEK).dt.days / 7)
.astype(int)
.apply(lambda week_idx: FIRST_DAY_OF_FIRST_WHOLE_WEEK + pd.to_timedelta(timedelta(weeks=week_idx)))
)
df['day_hour'] = df['date'].dt.day_name() + '-' + df['datetime'].dt.hour.astype(str).str.zfill(2)
df['any_success'] = (df['success'] + df['success_dnn'] > 0).astype(np.float32)
df.pop('datetime_s')
predict_df2[lot_name] = df
predict2_with_lot_name = predict_df2.copy()
for lot in LOT_NAMES:
predict2_with_lot_name[lot]['lot'] = lot
predict2_with_lot_name[lot]
predict2_with_lot_name = pd.concat(predict2_with_lot_name.values())
Like before, lets run the same visualizations, this time on any_success so we will get better results.
fig, axes = plt.subplots(len(LOT_NAMES), 1, figsize=(15, 8 * len(LOT_NAMES)), sharey=True)
fig.suptitle('Model Accuracy Per Hour', fontsize=22, y=0.895)
plt.subplots_adjust(hspace=0.4)
for i, lot in enumerate(LOT_NAMES):
df = predict_df2[lot]
ax = axes[i]
plot = sns.scatterplot(
ax=ax,
palette=GREEN_RED_PALETTE,
legend=False,
x='day_hour',
y='week_first_day',
hue='any_success',
s=10,
data=df
)
ax.set_title(lot)
for ax in axes:
for i, label in enumerate(ax.get_xticklabels()):
if i % 4 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
Great, we can see that there is less random noise and we can definitely spot the existance of patterns. Now let's focus on week and hourly patterns:
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Hour In The Week', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='day_hour',
y='any_success',
data=predict2_with_lot_name
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
g = sns.lineplot(
ax=ax,
x='day_hour',
y='any_success',
data=predict2_with_lot_name[predict2_with_lot_name['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0.5, top=1)
for i, label in enumerate(ax.get_xticklabels()):
if i % 8 == 0:
label.set_rotation(90)
else:
label.set_visible(False)
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Hour', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='hour',
y='any_success',
data=predict2_with_lot_name
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
g = sns.lineplot(
ax=ax,
x='hour',
y='any_success',
data=predict2_with_lot_name[predict2_with_lot_name['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0.5, top=1)
for i, label in enumerate(ax.get_xticklabels()):
label.set_rotation(90)
Here we go! this time we can see that Basel and Asuta were harder to predict specifically very early at the morning, when someone leaves its parking spot and there is barely traffic, and around 6-7pm, when people come back from work and the parking lots are queuing up.
weekly_predict2_df = (
predict2_with_lot_name
.groupby(by=['week_first_day', 'lot'])
.agg({'any_success': 'mean'})
.reset_index()
)
fig = plt.figure(constrained_layout=True, figsize=(13, 8))
fig.suptitle('Average Accuracy Per Week Over Time', fontsize=22)
gs = fig.add_gridspec(3, 3)
axes = [fig.add_subplot(gs[0, :])]
for i in range(len(LOT_NAMES)):
axes.append(fig.add_subplot(gs[1 + i // 3, i % 3]))
ax = axes[0]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='any_success',
data=weekly_predict2_df
)
ax.set_title('All Lots Combined')
for i, lot in enumerate(LOT_NAMES):
ax = axes[i + 1]
sns.lineplot(
ax=ax,
x='week_first_day',
marker='o',
y='any_success',
data=weekly_predict2_df[weekly_predict2_df['lot'] == lot]
)
ax.set_title(lot)
for ax in axes:
ax.set_ylim(bottom=0.7, top=1)
for i, label in enumerate(ax.get_xticklabels()):
label.set_rotation(90)
Over time, we can spot some anomalies usually around holidays. More over, we can see some spikes on uncertainty around April 2020, September 2020 and December 2020 - months with heavy covid restrictions!
We had great time working on this intersting project! Each and every aspect of it has taught us new skills. We practiced data visualizations, worked with classifiers and tweaked some deep learning models. We also learned techniques to cross-validate our results to reduce bias.
Even though we decided to end this project here, we believe there are many ways we can continue exploring from here. We might be able to find correlations with other data such as NLP analysis of news. There is also much room for improvement in our models - we only tested few configurations with not-so-effective computational efficiency. We could push our anomaly detection even further and create 2nd degree models that predicts anomalies using our trained models!
Since we're staying in Tel-Aviv, we will keep collecting data and continue this project and on our free time. We already have almost two years of data, it would be worth checking it again once we have even more data.